Key Takeaway:
Scale AI has launched the initiative “Humanity’s Last Exam” to determine the capabilities of cutting-edge large language models (LLMs) like Google Gemini and OpenAI’s o1. The initiative aims to determine how close we are to achieving AI systems that rival human expertise. The challenge lies in evaluating intelligence in machines that can already pass established tests in mathematics, law, and reasoning with ease. The task is complicated by the immense training datasets that fuel these systems, enabling them to pre-learn answers from the vast archives of human knowledge available online. The future of testing AI might lie beyond the digital realm, with innovations such as “embodied AI” and human wearables. The search for effective AI tests is not just academic but also raises ethical and safety considerations.
The challenge posed by two of San Francisco’s most influential AI players has sparked global intrigue: could the public devise questions that truly measure the capabilities of cutting-edge large language models (LLMs) like Google Gemini and OpenAI’s o1? In collaboration with the Center for AI Safety (CAIS), Scale AI launched the initiative provocatively dubbed Humanity’s Last Exam. The stakes are high, with $5,000 prizes for the top 50 questions, but the implications go far beyond financial incentives.
The goal is as audacious as it sounds: to determine how close we are to achieving AI systems that rival human expertise. Behind this initiative lies a deeper question—how do we evaluate intelligence in machines that can already pass established tests in mathematics, law, and reasoning with ease? The task is complicated by the immense training datasets that fuel these systems, enabling them to pre-learn answers from the vast archives of human knowledge available online.
The Data Dilemma
At the heart of AI’s evolution lies data—a force that has redefined computing itself. Unlike traditional systems that require explicit instructions, AI thrives on being shown patterns through extensive datasets. But evaluating these systems requires something more nuanced: test datasets, untouched by the training phase, that can genuinely assess how well the AI understands rather than recalls.
Experts warn that this challenge is becoming more urgent. Predictions suggest that by 2028, AI systems could effectively digest the entirety of humanity’s written output. With this horizon approaching, the question is no longer just about what AIs know but how to assess their abilities when their training encompasses nearly everything ever created.
Adding to the complexity is the phenomenon of “model collapse.” As AI-generated content increasingly populates the internet, future systems risk being trained on recycled material, potentially leading to declining performance. To counteract this, many developers are already turning to human interactions, using real-world exchanges to generate fresh training and testing data.
AI in the Real World
The future of testing AI might lie beyond the digital realm. Some researchers argue that intelligence can only be measured when machines experience the world as humans do. This vision of “embodied AI” is already being explored, with companies like Tesla using autonomous vehicles to collect real-world data. Other opportunities may come from human wearables, such as Meta’s smart glasses, equipped with cameras and microphones to record human-centric experiences.
These innovations could provide new avenues for AI learning, but they also underscore a deeper challenge—defining and measuring intelligence itself. For humans, intelligence encompasses diverse abilities, from problem-solving to emotional understanding. Traditional IQ tests have long faced criticism for their narrow scope, and similar limitations plague current AI benchmarks.
Narrow Measures, Broader Questions
Most existing tests for AI focus on specific tasks like text summarisation, visual recognition, or gesture interpretation. While these tests have been instrumental, their narrow focus limits their ability to measure broader intelligence. Take, for instance, the chess-playing AI Stockfish. It outperforms even Magnus Carlsen, the highest-rated human chess player in history, but its mastery of chess doesn’t extend to other forms of reasoning or understanding.
As AIs demonstrate more versatile capabilities, the need for better benchmarks grows. French Google engineer François Chollet introduced one such test: the “abstraction and reasoning corpus” (ARC). Unlike traditional benchmarks, ARC challenges AIs to infer and apply abstract rules to solve puzzles. The test relies on minimal prior data, forcing the system to think flexibly rather than relying on pre-learned solutions.
ARC’s puzzles, presented as simple visual grids, are easy for humans but pose significant hurdles for AIs. Current models like OpenAI’s o1 and Anthropic’s Sonnet 3.5 score only 21% on the ARC leaderboard, far below the human average of over 90%. Even a recent attempt using GPT-4o, which employed a brute-force approach of generating thousands of possible answers, reached only 50%—still well short of the $600,000 prize for achieving 85%.
The Quest for Meaningful Tests
The ARC framework stands out as one of the most credible attempts to measure AI intelligence, but it’s not the only effort. The Scale/CAIS initiative aims to uncover novel ways to probe AI reasoning. Interestingly, the winning questions from Humanity’s Last Exam may remain unpublished to prevent AIs from accessing them during training—a clever safeguard against gaming the system.
The search for effective AI tests is not just academic; it’s a vital step toward understanding when machines approach human-level reasoning. This raises profound ethical and safety considerations. How do we prepare for a world where machines can think, reason, and act with the same—or greater—capabilities as humans?
The Next Frontier
Beyond assessing human-level AI, the horizon looms with an even greater challenge: testing for superintelligence. If machines surpass human cognition, how will we measure their understanding? The task is daunting, with implications for governance, safety, and control. As humanity crafts its questions for Humanity’s Last Exam, it also grapples with an existential query—what happens when the test taker outpaces the examiner?
In this unfolding story of AI, each breakthrough brings us closer to answers but also deeper into uncharted territory. The exam is not just a test of machines; it’s a reflection of humanity’s own readiness for the future it is creating.