The pace of artificial intelligence advancement in mathematics surpasses human capacity to design tests that evaluate it. Models like those from Google DeepMind render benchmarks obsolete in months, a cycle that is accelerating. This poses a problem for science: how to measure capabilities that evolve exponentially? The need for new evaluation methods is clear.
The Benchmark Obsolescence Cycle ??
Current systems, trained on massive volumes of data and techniques like chain-of-thought reasoning, quickly dominate specific problem sets. Once a new test is published, the community uses it to train and fine-tune models, which soon surpass it. This process shortens the lifespan of any metric, forcing researchers to seek problems with greater structural complexity or that require a conceptual leap not present in the training data.
Scientists Ask AI to Evaluate Itself, Please ??
Faced with the situation, some propose creative solutions. The most popular is to ask the AI itself to generate future exams. It's a flawless plan: we delegate the heavy work and then complain that the questions are too easy for it. The next logical step will be for the AI to also self-correct, write the paper, and submit it to a journal, freeing us definitively from the bother of thinking.