Key Takeaway:


The challenge posed by two of San Francisco’s most influential AI players has sparked global intrigue: could the public devise questions that truly measure the capabilities of cutting-edge large language models (LLMs) like Google Gemini and OpenAI’s o1? In collaboration with the Center for AI Safety (CAIS), Scale AI launched the initiative provocatively dubbed Humanity’s Last Exam. The stakes are high, with $5,000 prizes for the top 50 questions, but the implications go far beyond financial incentives.

The goal is as audacious as it sounds: to determine how close we are to achieving AI systems that rival human expertise. Behind this initiative lies a deeper question—how do we evaluate intelligence in machines that can already pass established tests in mathematics, law, and reasoning with ease? The task is complicated by the immense training datasets that fuel these systems, enabling them to pre-learn answers from the vast archives of human knowledge available online.

The Data Dilemma

At the heart of AI’s evolution lies data—a force that has redefined computing itself. Unlike traditional systems that require explicit instructions, AI thrives on being shown patterns through extensive datasets. But evaluating these systems requires something more nuanced: test datasets, untouched by the training phase, that can genuinely assess how well the AI understands rather than recalls.

Experts warn that this challenge is becoming more urgent. Predictions suggest that by 2028, AI systems could effectively digest the entirety of humanity’s written output. With this horizon approaching, the question is no longer just about what AIs know but how to assess their abilities when their training encompasses nearly everything ever created.

Adding to the complexity is the phenomenon of “model collapse.” As AI-generated content increasingly populates the internet, future systems risk being trained on recycled material, potentially leading to declining performance. To counteract this, many developers are already turning to human interactions, using real-world exchanges to generate fresh training and testing data.

AI in the Real World

The future of testing AI might lie beyond the digital realm. Some researchers argue that intelligence can only be measured when machines experience the world as humans do. This vision of “embodied AI” is already being explored, with companies like Tesla using autonomous vehicles to collect real-world data. Other opportunities may come from human wearables, such as Meta’s smart glasses, equipped with cameras and microphones to record human-centric experiences.

These innovations could provide new avenues for AI learning, but they also underscore a deeper challenge—defining and measuring intelligence itself. For humans, intelligence encompasses diverse abilities, from problem-solving to emotional understanding. Traditional IQ tests have long faced criticism for their narrow scope, and similar limitations plague current AI benchmarks.

Narrow Measures, Broader Questions

Most existing tests for AI focus on specific tasks like text summarisation, visual recognition, or gesture interpretation. While these tests have been instrumental, their narrow focus limits their ability to measure broader intelligence. Take, for instance, the chess-playing AI Stockfish. It outperforms even Magnus Carlsen, the highest-rated human chess player in history, but its mastery of chess doesn’t extend to other forms of reasoning or understanding.

As AIs demonstrate more versatile capabilities, the need for better benchmarks grows. French Google engineer François Chollet introduced one such test: the “abstraction and reasoning corpus” (ARC). Unlike traditional benchmarks, ARC challenges AIs to infer and apply abstract rules to solve puzzles. The test relies on minimal prior data, forcing the system to think flexibly rather than relying on pre-learned solutions.

ARC’s puzzles, presented as simple visual grids, are easy for humans but pose significant hurdles for AIs. Current models like OpenAI’s o1 and Anthropic’s Sonnet 3.5 score only 21% on the ARC leaderboard, far below the human average of over 90%. Even a recent attempt using GPT-4o, which employed a brute-force approach of generating thousands of possible answers, reached only 50%—still well short of the $600,000 prize for achieving 85%.

The Quest for Meaningful Tests

The ARC framework stands out as one of the most credible attempts to measure AI intelligence, but it’s not the only effort. The Scale/CAIS initiative aims to uncover novel ways to probe AI reasoning. Interestingly, the winning questions from Humanity’s Last Exam may remain unpublished to prevent AIs from accessing them during training—a clever safeguard against gaming the system.

The search for effective AI tests is not just academic; it’s a vital step toward understanding when machines approach human-level reasoning. This raises profound ethical and safety considerations. How do we prepare for a world where machines can think, reason, and act with the same—or greater—capabilities as humans?

The Next Frontier

Beyond assessing human-level AI, the horizon looms with an even greater challenge: testing for superintelligence. If machines surpass human cognition, how will we measure their understanding? The task is daunting, with implications for governance, safety, and control. As humanity crafts its questions for Humanity’s Last Exam, it also grapples with an existential query—what happens when the test taker outpaces the examiner?

In this unfolding story of AI, each breakthrough brings us closer to answers but also deeper into uncharted territory. The exam is not just a test of machines; it’s a reflection of humanity’s own readiness for the future it is creating.

Recently Published

Key Takeaway: The mystery of consciousness has been a subject of debate for centuries, with numerous theories vying for the title. In 2024, the Cogitate Consortium conducted an “adversarial collaboration” between Global Neuronal Workspace Theory (GNWT) and Integrated Information Theory (IIT). The study aimed to test the theories under neutral conditions, revealing that confirmation bias […]

Top Picks

Key Takeaway: President Donald Trump’s executive order titled “Restoring Freedom of Speech and Ending Federal Censorship” accused the previous administration of stifling free expression by working with social media companies to curb misinformation and label misleading content. However, recent research suggests that less regulation can actually make the internet a less free place for speech, […]
Key Takeaway: Recent research published in Science has revealed that the brain uses multiple learning mechanisms simultaneously, revealing the complexity behind brain wiring. Neurons communicate through electric signals called synapses, which form complex networks of connections that transmit information. The traditional theory of synaptic plasticity assumes uniformity between neurons, but the new study found that […]
Key Takeaway: A new study published in Nature Astronomy claims that the James Webb Space Telescope (JWST) has detected atmospheric signals on K2-18b, a distant world 124 light-years from Earth. The researchers found traces of molecules often associated with biological activity on Earth, including dimethyl sulphide (DMS). The scientists are 99.7% confident in the presence […]
Key Takeaway: Belief in the supernatural, including ghosts, spirits, astrology, and psychic powers, is more common than people might expect. These beliefs offer a sense of control, meaning, and comfort in the face of life’s unpredictability. They fall outside the boundaries of conventional science and include ideas like fate, spiritual forces, and life after death. […]
Key Takeaway: Brain-computer interface (BCI) technology is rapidly redefining human potential, with breakthroughs in artificial intelligence and machine learning enabling the translation of thoughts into action. The brain is a complex network of over 80 billion neurons, processing thoughts, memories, emotions, and sensory inputs. Advances in AI, miniaturized electronics, and neuroimaging have led to the […]

Trending

I highly recommend reading the McKinsey Global Institute’s new report, “Reskilling China: Transforming The World’s Largest Workforce Into Lifelong Learners”, which focuses on the country’s biggest employment challenge, re-training its workforce and the adoption of practices such as lifelong learning to address the growing digital transformation of its productive fabric. How to transform the country […]

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

Login

Welcome to Empirics

We are glad you have decided to join our mission of gathering the collective knowledge of Asia!
Join Empirics