Key Takeaway:


Imagine a future where artificial intelligence surpasses human intellect, where machines solve problems faster than we can comprehend. That future might be closer than you think, and two of San Francisco’s top AI players want the public’s help to get there. They’ve launched a provocative new initiative, Humanity’s Last Exam, offering $5,000 to anyone who can create a question tough enough to test the limits of AI systems like Google Gemini and OpenAI’s o1.

This isn’t just another tech contest. It’s a bold experiment in assessing just how close we are to creating AI that rivals—or surpasses—human intelligence. Scale AI, a company specializing in curating vast datasets for AI training, teamed up with the Center for AI Safety (CAIS) to issue the challenge. Together, they’re rallying experts worldwide to design a test that can measure the capabilities of artificial general intelligence (AGI).

The Quest for the Ultimate Test

Why go through all this trouble? Leading AI models are already acing conventional exams—outperforming humans in math, law, and logic tests. But here’s the catch: many believe these AIs might not be truly “thinking.” Instead, they could simply be regurgitating information they’ve absorbed from the internet. Given that these models are trained on vast swathes of online data, including the entire content of Wikipedia, it’s possible they already know the answers to the tests we use.

The central issue is data. These machines learn by analyzing immense datasets, shifting the paradigm from traditional programming to machine learning. Developers feed the AI information and test it on data it hasn’t seen before—known as a “test dataset”—to gauge its abilities. But what happens when an AI has absorbed all available human knowledge? Some experts predict that by 2028, AIs will have effectively read everything ever written by humans. At that point, designing meaningful tests becomes an even more pressing challenge.

The Looming Problem of “Model Collapse”

Another obstacle on the horizon is the concept of “model collapse.” As AI-generated content floods the internet and gets reused in future training datasets, there’s a growing concern that these systems may begin to deteriorate in quality. Essentially, AIs might start feeding off their own content, leading to diminishing returns in their performance. To combat this, many developers are collecting data from human interactions with AIs, providing fresh material for training and testing.

Some believe that the next step in AI evolution involves more than just amassing data—it requires real-world experience. Just as humans learn by interacting with their surroundings, AIs may need to become “embodied” to truly understand and adapt. This concept isn’t as futuristic as it sounds; Tesla has been doing this for years with its self-driving cars. Additionally, wearables like Meta’s Ray-Ban smart glasses, equipped with cameras and microphones, could serve as a source of human-centric data to further train AIs.

The Struggle to Define Intelligence

Even with these advancements, one critical question remains: how do we measure true intelligence, especially in machines? For decades, traditional IQ tests have been criticized for their narrow focus, failing to capture the diverse aspects of intelligence—everything from creativity to empathy. The same issue applies to AI testing. While there are well-established benchmarks for tasks like summarizing text or recognizing gestures, these tests tend to measure very specific skills.

Take Stockfish, the world’s top chess-playing AI, for example. It dominates human grandmasters like Magnus Carlsen but is completely incapable of other cognitive tasks like language processing. Clearly, excelling at chess doesn’t equate to broader intelligence. The challenge now is to devise tests that reflect a more general, adaptable form of AI intelligence.

A New Kind of Test

One notable attempt to crack this puzzle comes from François Chollet, a Google engineer who created the “abstraction and reasoning corpus” (ARC) in 2019. Unlike traditional AI tests, which often rely on feeding the machine millions of examples, ARC presents simple visual puzzles and asks the AI to deduce the underlying logic with minimal prior information. The goal is to test an AI’s ability to generalize and adapt, rather than just memorize data.

ARC is tough. In fact, no AI has come close to mastering it. The best AIs, including OpenAI’s o1 and Anthropic’s Sonnet 3.5, have only scored around 21% on the ARC leaderboard. By contrast, humans consistently score over 90%. This gap suggests that while AIs are excelling in task-specific areas, they still have a long way to go in terms of general reasoning.

One controversial breakthrough involved OpenAI’s GPT-4o, which achieved a 50% score on ARC by generating thousands of possible solutions before selecting the best one. While this method raised eyebrows, it’s still nowhere near human-level performance. The ARC challenge remains one of the most credible ways to measure true AI intelligence, and the $600,000 prize for the first system to reach 85% remains unclaimed.

The Search Continues

Despite ARC’s credibility, there’s a growing need for alternative tests. This is where initiatives like Humanity’s Last Exam come into play. By crowdsourcing questions from the public, Scale AI and CAIS hope to uncover new ways to push AI to its limits. Intriguingly, some of these questions may never be made public. To ensure fairness, certain prize-winning questions won’t be published online—preventing future AIs from “peeking at the answers.”

In the end, the stakes are high. We need to know when machines are approaching human-level reasoning, not just for the technological implications, but for the ethical and moral dilemmas that will inevitably follow. Once we cross that line, we’ll be faced with an even greater challenge: how to assess and manage superintelligence. It’s a mind-boggling task, but one we need to solve before the machines surpass us entirely.

The race to test AI is on—and the ultimate exam may still be ahead.

Recently Published

Key Takeaway: Scientists have discovered that some organelles function perfectly without a membrane, introducing a new class called biomolecular condensates. These membraneless organelles create unique biochemical compartments within cells, attracting specific proteins and RNA molecules. Currently, scientists have identified around 30 different types of biomolecular condensates, compared to just over a dozen traditional, membrane-bound organelles. […]
Key Takeaway: Memory research has long explored the reasons behind forgetting, with the “forgetting curve” illustrating that people lose details of new information quickly after learning it. Memory formation involves strengthening synapses, making the memory more resilient. Memory is also adaptive, constantly evolving to handle new information. This adaptability is vital for navigating minor changes […]

Top Picks

Key Takeaway: Alice Walton, a Walmart heiress, has a $1.5 billion philanthropic footprint, including a $390 million donation in 2023 to support the Alice L Walton School of Medicine. However, her philanthropic efforts raise questions about the societal costs of billionaire giving and the genuineness of her contributions. The broader social and economic costs of […]
Key Takeaway: CubeSats, affordable, lightweight satellites, are revolutionizing space exploration by focusing on single scientific goals like observing asteroids or tracking water on the Moon. They travel as secondary payloads, minimizing space debris and accelerating discovery. CubeSats are also unlocking mysteries of distant worlds, paving the way for humanity’s dreams of becoming a multiplanetary species. […]
Key Takeaway: The fascination with fear is deeply rooted in human evolution, with emotions playing a critical role in survival. Controlled fear experiences, such as watching horror movies or navigating a haunted house, offer a safe way to engage with this powerful emotion. Exposure to intense fear reduces anxiety levels afterward, leaving people feeling more […]
Key Takeaway: Daron Acemoglu, Simon Johnson, and James Robinson were awarded the 2024 Nobel Memorial Prize in economics for their book Why Nations Fail, which argued that countries succeed when they adopt inclusive institutions like democracy, while extractive institutions stifle growth by concentrating power and wealth in elites. However, their theory has faced criticism for […]
Key Takeaway: In a world filled with crises, existential anxiety is a growing concern. This anxiety often manifests as mental distress, with the brain responding with stress hormones. Doomscrolling, a cycle of anxiety, can intensify feelings of helplessness and vulnerability to conspiracy theories. To manage this, stress-reduction exercises, emotional and social connections, and problem-solving can […]

Trending

I highly recommend reading the McKinsey Global Institute’s new report, “Reskilling China: Transforming The World’s Largest Workforce Into Lifelong Learners”, which focuses on the country’s biggest employment challenge, re-training its workforce and the adoption of practices such as lifelong learning to address the growing digital transformation of its productive fabric. How to transform the country […]

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

Login

Welcome to Empirics

We are glad you have decided to join our mission of gathering the collective knowledge of Asia!
Join Empirics