Key Takeaway:


The challenge posed by two of San Francisco’s most influential AI players has sparked global intrigue: could the public devise questions that truly measure the capabilities of cutting-edge large language models (LLMs) like Google Gemini and OpenAI’s o1? In collaboration with the Center for AI Safety (CAIS), Scale AI launched the initiative provocatively dubbed Humanity’s Last Exam. The stakes are high, with $5,000 prizes for the top 50 questions, but the implications go far beyond financial incentives.

The goal is as audacious as it sounds: to determine how close we are to achieving AI systems that rival human expertise. Behind this initiative lies a deeper question—how do we evaluate intelligence in machines that can already pass established tests in mathematics, law, and reasoning with ease? The task is complicated by the immense training datasets that fuel these systems, enabling them to pre-learn answers from the vast archives of human knowledge available online.

The Data Dilemma

At the heart of AI’s evolution lies data—a force that has redefined computing itself. Unlike traditional systems that require explicit instructions, AI thrives on being shown patterns through extensive datasets. But evaluating these systems requires something more nuanced: test datasets, untouched by the training phase, that can genuinely assess how well the AI understands rather than recalls.

Experts warn that this challenge is becoming more urgent. Predictions suggest that by 2028, AI systems could effectively digest the entirety of humanity’s written output. With this horizon approaching, the question is no longer just about what AIs know but how to assess their abilities when their training encompasses nearly everything ever created.

Adding to the complexity is the phenomenon of “model collapse.” As AI-generated content increasingly populates the internet, future systems risk being trained on recycled material, potentially leading to declining performance. To counteract this, many developers are already turning to human interactions, using real-world exchanges to generate fresh training and testing data.

AI in the Real World

The future of testing AI might lie beyond the digital realm. Some researchers argue that intelligence can only be measured when machines experience the world as humans do. This vision of “embodied AI” is already being explored, with companies like Tesla using autonomous vehicles to collect real-world data. Other opportunities may come from human wearables, such as Meta’s smart glasses, equipped with cameras and microphones to record human-centric experiences.

These innovations could provide new avenues for AI learning, but they also underscore a deeper challenge—defining and measuring intelligence itself. For humans, intelligence encompasses diverse abilities, from problem-solving to emotional understanding. Traditional IQ tests have long faced criticism for their narrow scope, and similar limitations plague current AI benchmarks.

Narrow Measures, Broader Questions

Most existing tests for AI focus on specific tasks like text summarisation, visual recognition, or gesture interpretation. While these tests have been instrumental, their narrow focus limits their ability to measure broader intelligence. Take, for instance, the chess-playing AI Stockfish. It outperforms even Magnus Carlsen, the highest-rated human chess player in history, but its mastery of chess doesn’t extend to other forms of reasoning or understanding.

As AIs demonstrate more versatile capabilities, the need for better benchmarks grows. French Google engineer François Chollet introduced one such test: the “abstraction and reasoning corpus” (ARC). Unlike traditional benchmarks, ARC challenges AIs to infer and apply abstract rules to solve puzzles. The test relies on minimal prior data, forcing the system to think flexibly rather than relying on pre-learned solutions.

ARC’s puzzles, presented as simple visual grids, are easy for humans but pose significant hurdles for AIs. Current models like OpenAI’s o1 and Anthropic’s Sonnet 3.5 score only 21% on the ARC leaderboard, far below the human average of over 90%. Even a recent attempt using GPT-4o, which employed a brute-force approach of generating thousands of possible answers, reached only 50%—still well short of the $600,000 prize for achieving 85%.

The Quest for Meaningful Tests

The ARC framework stands out as one of the most credible attempts to measure AI intelligence, but it’s not the only effort. The Scale/CAIS initiative aims to uncover novel ways to probe AI reasoning. Interestingly, the winning questions from Humanity’s Last Exam may remain unpublished to prevent AIs from accessing them during training—a clever safeguard against gaming the system.

The search for effective AI tests is not just academic; it’s a vital step toward understanding when machines approach human-level reasoning. This raises profound ethical and safety considerations. How do we prepare for a world where machines can think, reason, and act with the same—or greater—capabilities as humans?

The Next Frontier

Beyond assessing human-level AI, the horizon looms with an even greater challenge: testing for superintelligence. If machines surpass human cognition, how will we measure their understanding? The task is daunting, with implications for governance, safety, and control. As humanity crafts its questions for Humanity’s Last Exam, it also grapples with an existential query—what happens when the test taker outpaces the examiner?

In this unfolding story of AI, each breakthrough brings us closer to answers but also deeper into uncharted territory. The exam is not just a test of machines; it’s a reflection of humanity’s own readiness for the future it is creating.

Recently Published

Key Takeaway: Plant-based meat substitutes and lab-grown meat are gaining popularity due to environmental sustainability and healthier eating habits. Plant-based alternatives aim to recreate meat’s sensory and nutritional properties using plant ingredients. They rely on non-animal proteins, water, fats, and additives like flavor enhancers and binders. The texture of plant-based meats is a meticulous process, […]

Top Picks

Key Takeaway: Volkswagen, once a symbol of German industry and co-management between shareholders and unions, is facing a crisis due to strategic missteps, a convoluted governance structure, and a culture that often prioritizes control over innovation. The company’s journey began in 1937 with the Beetle, which became the world’s largest carmaker in the 1980s and […]
Key Takeaway: The Moon’s silent pull shapes life on Earth in various ways, from orchestrating mass spawning events in coral reefs to guiding predators and prey’s nightly routines. For millennia, creatures have lived in tune with its phases, responding to its light and gravitational sway. The Moon’s influence extends beyond oceans, as its reflected light […]
Key Takeaway: Elon Musk’s social media platform, X, has become a dominant force in U.S. presidential politics, capturing unprecedented engagement and becoming a megaphone for President Trump’s campaign. Musk’s vision for transforming X into an “everything app” akin to China’s WeChat has been unveiled, with his net worth reaching $300 billion. However, challenges such as […]
Key Takeaway: The origins of commercial gambling can be traced back to the mid-1600s when mathematical probability emerged. In 1713, mathematician brothers Johann and Jacob Bernoulli introduced the “Golden Theorem,” later known as the law of large numbers, or long averages. This led to a “probability revolution” in gambling, transforming the industry in Britain and […]

Trending

I highly recommend reading the McKinsey Global Institute’s new report, “Reskilling China: Transforming The World’s Largest Workforce Into Lifelong Learners”, which focuses on the country’s biggest employment challenge, re-training its workforce and the adoption of practices such as lifelong learning to address the growing digital transformation of its productive fabric. How to transform the country […]

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

Login

Welcome to Empirics

We are glad you have decided to join our mission of gathering the collective knowledge of Asia!
Join Empirics