Key Takeaway:
Humanity’s Last Exam is an initiative by Scale AI and the Center for AI Safety (CAIS) to test the limits of artificial general intelligence (AGI). The challenge is not just a tech contest, but a bold experiment in assessing the progress of AI. AI models are already outperforming humans in math, law, and logic tests, but many believe they might be regurgitating information from the internet. The central issue is data, as AI models learn by analyzing immense datasets. The concept of “model collapse” is another challenge, as AI-generated content may deteriorate in quality. To combat this, developers are collecting data from human interactions with AIs, and some believe AIs may need to become “embodied” to understand and adapt. The challenge now is to devise tests that reflect a more general, adaptable form of AI intelligence.
Imagine a future where artificial intelligence surpasses human intellect, where machines solve problems faster than we can comprehend. That future might be closer than you think, and two of San Francisco’s top AI players want the public’s help to get there. They’ve launched a provocative new initiative, Humanity’s Last Exam, offering $5,000 to anyone who can create a question tough enough to test the limits of AI systems like Google Gemini and OpenAI’s o1.
This isn’t just another tech contest. It’s a bold experiment in assessing just how close we are to creating AI that rivals—or surpasses—human intelligence. Scale AI, a company specializing in curating vast datasets for AI training, teamed up with the Center for AI Safety (CAIS) to issue the challenge. Together, they’re rallying experts worldwide to design a test that can measure the capabilities of artificial general intelligence (AGI).
The Quest for the Ultimate Test
Why go through all this trouble? Leading AI models are already acing conventional exams—outperforming humans in math, law, and logic tests. But here’s the catch: many believe these AIs might not be truly “thinking.” Instead, they could simply be regurgitating information they’ve absorbed from the internet. Given that these models are trained on vast swathes of online data, including the entire content of Wikipedia, it’s possible they already know the answers to the tests we use.
The central issue is data. These machines learn by analyzing immense datasets, shifting the paradigm from traditional programming to machine learning. Developers feed the AI information and test it on data it hasn’t seen before—known as a “test dataset”—to gauge its abilities. But what happens when an AI has absorbed all available human knowledge? Some experts predict that by 2028, AIs will have effectively read everything ever written by humans. At that point, designing meaningful tests becomes an even more pressing challenge.
The Looming Problem of “Model Collapse”
Another obstacle on the horizon is the concept of “model collapse.” As AI-generated content floods the internet and gets reused in future training datasets, there’s a growing concern that these systems may begin to deteriorate in quality. Essentially, AIs might start feeding off their own content, leading to diminishing returns in their performance. To combat this, many developers are collecting data from human interactions with AIs, providing fresh material for training and testing.
Some believe that the next step in AI evolution involves more than just amassing data—it requires real-world experience. Just as humans learn by interacting with their surroundings, AIs may need to become “embodied” to truly understand and adapt. This concept isn’t as futuristic as it sounds; Tesla has been doing this for years with its self-driving cars. Additionally, wearables like Meta’s Ray-Ban smart glasses, equipped with cameras and microphones, could serve as a source of human-centric data to further train AIs.
The Struggle to Define Intelligence
Even with these advancements, one critical question remains: how do we measure true intelligence, especially in machines? For decades, traditional IQ tests have been criticized for their narrow focus, failing to capture the diverse aspects of intelligence—everything from creativity to empathy. The same issue applies to AI testing. While there are well-established benchmarks for tasks like summarizing text or recognizing gestures, these tests tend to measure very specific skills.
Take Stockfish, the world’s top chess-playing AI, for example. It dominates human grandmasters like Magnus Carlsen but is completely incapable of other cognitive tasks like language processing. Clearly, excelling at chess doesn’t equate to broader intelligence. The challenge now is to devise tests that reflect a more general, adaptable form of AI intelligence.
A New Kind of Test
One notable attempt to crack this puzzle comes from François Chollet, a Google engineer who created the “abstraction and reasoning corpus” (ARC) in 2019. Unlike traditional AI tests, which often rely on feeding the machine millions of examples, ARC presents simple visual puzzles and asks the AI to deduce the underlying logic with minimal prior information. The goal is to test an AI’s ability to generalize and adapt, rather than just memorize data.
ARC is tough. In fact, no AI has come close to mastering it. The best AIs, including OpenAI’s o1 and Anthropic’s Sonnet 3.5, have only scored around 21% on the ARC leaderboard. By contrast, humans consistently score over 90%. This gap suggests that while AIs are excelling in task-specific areas, they still have a long way to go in terms of general reasoning.
One controversial breakthrough involved OpenAI’s GPT-4o, which achieved a 50% score on ARC by generating thousands of possible solutions before selecting the best one. While this method raised eyebrows, it’s still nowhere near human-level performance. The ARC challenge remains one of the most credible ways to measure true AI intelligence, and the $600,000 prize for the first system to reach 85% remains unclaimed.
The Search Continues
Despite ARC’s credibility, there’s a growing need for alternative tests. This is where initiatives like Humanity’s Last Exam come into play. By crowdsourcing questions from the public, Scale AI and CAIS hope to uncover new ways to push AI to its limits. Intriguingly, some of these questions may never be made public. To ensure fairness, certain prize-winning questions won’t be published online—preventing future AIs from “peeking at the answers.”
In the end, the stakes are high. We need to know when machines are approaching human-level reasoning, not just for the technological implications, but for the ethical and moral dilemmas that will inevitably follow. Once we cross that line, we’ll be faced with an even greater challenge: how to assess and manage superintelligence. It’s a mind-boggling task, but one we need to solve before the machines surpass us entirely.
The race to test AI is on—and the ultimate exam may still be ahead.