Key Takeaway:


Imagine a future where artificial intelligence surpasses human intellect, where machines solve problems faster than we can comprehend. That future might be closer than you think, and two of San Francisco’s top AI players want the public’s help to get there. They’ve launched a provocative new initiative, Humanity’s Last Exam, offering $5,000 to anyone who can create a question tough enough to test the limits of AI systems like Google Gemini and OpenAI’s o1.

This isn’t just another tech contest. It’s a bold experiment in assessing just how close we are to creating AI that rivals—or surpasses—human intelligence. Scale AI, a company specializing in curating vast datasets for AI training, teamed up with the Center for AI Safety (CAIS) to issue the challenge. Together, they’re rallying experts worldwide to design a test that can measure the capabilities of artificial general intelligence (AGI).

The Quest for the Ultimate Test

Why go through all this trouble? Leading AI models are already acing conventional exams—outperforming humans in math, law, and logic tests. But here’s the catch: many believe these AIs might not be truly “thinking.” Instead, they could simply be regurgitating information they’ve absorbed from the internet. Given that these models are trained on vast swathes of online data, including the entire content of Wikipedia, it’s possible they already know the answers to the tests we use.

The central issue is data. These machines learn by analyzing immense datasets, shifting the paradigm from traditional programming to machine learning. Developers feed the AI information and test it on data it hasn’t seen before—known as a “test dataset”—to gauge its abilities. But what happens when an AI has absorbed all available human knowledge? Some experts predict that by 2028, AIs will have effectively read everything ever written by humans. At that point, designing meaningful tests becomes an even more pressing challenge.

The Looming Problem of “Model Collapse”

Another obstacle on the horizon is the concept of “model collapse.” As AI-generated content floods the internet and gets reused in future training datasets, there’s a growing concern that these systems may begin to deteriorate in quality. Essentially, AIs might start feeding off their own content, leading to diminishing returns in their performance. To combat this, many developers are collecting data from human interactions with AIs, providing fresh material for training and testing.

Some believe that the next step in AI evolution involves more than just amassing data—it requires real-world experience. Just as humans learn by interacting with their surroundings, AIs may need to become “embodied” to truly understand and adapt. This concept isn’t as futuristic as it sounds; Tesla has been doing this for years with its self-driving cars. Additionally, wearables like Meta’s Ray-Ban smart glasses, equipped with cameras and microphones, could serve as a source of human-centric data to further train AIs.

The Struggle to Define Intelligence

Even with these advancements, one critical question remains: how do we measure true intelligence, especially in machines? For decades, traditional IQ tests have been criticized for their narrow focus, failing to capture the diverse aspects of intelligence—everything from creativity to empathy. The same issue applies to AI testing. While there are well-established benchmarks for tasks like summarizing text or recognizing gestures, these tests tend to measure very specific skills.

Take Stockfish, the world’s top chess-playing AI, for example. It dominates human grandmasters like Magnus Carlsen but is completely incapable of other cognitive tasks like language processing. Clearly, excelling at chess doesn’t equate to broader intelligence. The challenge now is to devise tests that reflect a more general, adaptable form of AI intelligence.

A New Kind of Test

One notable attempt to crack this puzzle comes from François Chollet, a Google engineer who created the “abstraction and reasoning corpus” (ARC) in 2019. Unlike traditional AI tests, which often rely on feeding the machine millions of examples, ARC presents simple visual puzzles and asks the AI to deduce the underlying logic with minimal prior information. The goal is to test an AI’s ability to generalize and adapt, rather than just memorize data.

ARC is tough. In fact, no AI has come close to mastering it. The best AIs, including OpenAI’s o1 and Anthropic’s Sonnet 3.5, have only scored around 21% on the ARC leaderboard. By contrast, humans consistently score over 90%. This gap suggests that while AIs are excelling in task-specific areas, they still have a long way to go in terms of general reasoning.

One controversial breakthrough involved OpenAI’s GPT-4o, which achieved a 50% score on ARC by generating thousands of possible solutions before selecting the best one. While this method raised eyebrows, it’s still nowhere near human-level performance. The ARC challenge remains one of the most credible ways to measure true AI intelligence, and the $600,000 prize for the first system to reach 85% remains unclaimed.

The Search Continues

Despite ARC’s credibility, there’s a growing need for alternative tests. This is where initiatives like Humanity’s Last Exam come into play. By crowdsourcing questions from the public, Scale AI and CAIS hope to uncover new ways to push AI to its limits. Intriguingly, some of these questions may never be made public. To ensure fairness, certain prize-winning questions won’t be published online—preventing future AIs from “peeking at the answers.”

In the end, the stakes are high. We need to know when machines are approaching human-level reasoning, not just for the technological implications, but for the ethical and moral dilemmas that will inevitably follow. Once we cross that line, we’ll be faced with an even greater challenge: how to assess and manage superintelligence. It’s a mind-boggling task, but one we need to solve before the machines surpass us entirely.

The race to test AI is on—and the ultimate exam may still be ahead.

Recently Published

Key Takeaway: Researchers have developed a technology that creates “audible enclaves” in open air, creating highly focused, localized zones of sound. These isolated audio pockets allow sound to materialize only at a precise point in space, unheard by others nearby. This breakthrough could revolutionize public communication, entertainment, military applications, and office design. The process, known […]
Key Takeaway: AI-powered mental health tools, such as chatbots and self-help apps, offer immediate emotional support to those in need. However, these tools cannot replace the complexity, depth, and ethical safeguards of human therapy, especially when dealing with serious mental health issues. AI lacks emotional understanding, cultural context, and real-time adaptability, which can be dangerous […]

Top Picks

Key Takeaway: Research shows that some animals form surprising partnerships, challenging traditional views on how intelligence evolves in the animal kingdom. For example, Octavia and Finn, a day octopus and coral trout, work as a team, each bringing unique skills to the hunt. Other species have also developed remarkable partnerships, such as the greater honeyguide […]
Key Takeaway: Satellite re-entry, a process where defunct satellites are disposed of, is causing a significant environmental impact on Earth’s atmosphere. As satellite usage increases, researchers are focusing on the re-entry process itself, which releases metal particles into the Earth’s atmosphere. These particles, such as aluminum oxide and lithium, can influence the planet’s energy balance, […]

Trending

I highly recommend reading the McKinsey Global Institute’s new report, “Reskilling China: Transforming The World’s Largest Workforce Into Lifelong Learners”, which focuses on the country’s biggest employment challenge, re-training its workforce and the adoption of practices such as lifelong learning to address the growing digital transformation of its productive fabric. How to transform the country […]

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

Login

Welcome to Empirics

We are glad you have decided to join our mission of gathering the collective knowledge of Asia!
Join Empirics