AI Can't Pass 'Humanity's Last Exam'

New research, dubbed "Humanity's Last Exam", is challenging how we measure artificial intelligence, posing questions so complex that even today's most advanced AI models struggle to answer them.

Published in Nature, this benchmark aims to pinpoint the current limitations of AI, moving beyond conventional tests that models often "cram" for.

The exam compiles 2,500 questions, each requiring graduate-level expertise across a broad spectrum of subjects, from ancient languages to advanced physics.

Crucially, any question an AI could already answer correctly was excluded, ensuring the benchmark truly pushed the boundaries of machine capability.

Nearly a thousand international experts collaborated to craft these problems, highlighting the intricate knowledge gaps that still exist for AI.

The Problem with Traditional AI Benchmarks

AI developers often use benchmarks as a target, optimising their models to achieve high scores. While this might seem like progress, experts argue it often leads to AI becoming better at specific tests rather than demonstrating genuine, adaptable intelligence. As an analogy, think of a student memorising answers for an exam versus truly understanding the subject matter. When AI models are solely optimised for benchmarks, they improve their ability to answer those particular questions, not necessarily their underlying comprehension or reasoning.

This phenomenon is evident in the "Humanity's Last Exam" results. Since its online publication in early 2025, AI scores have climbed. Gemini 3 Pro Preview currently leads with 38.3% accuracy, followed by GPT-5 at 25.3% and Grok 4 at 24.5%. While these figures suggest improvement, it doesn't automatically mean these models are approaching human-level intelligence. Instead, it indicates they're becoming more adept at the types of questions featured in this specific exam. We've seen similar issues with AI creating a new "meaning" of work, not just the outputs, where the focus shifts to output rather than genuine cognitive tasks.

Beyond the Scoreboard: Real-World Usefulness

Recognising the limitations of benchmark-driven development, the industry is beginning to shift focus. OpenAI, for instance, has introduced GDPval, a new metric designed to assess the real-world usefulness of AI. This measure evaluates AI performance on practical tasks like drafting project documents, conducting data analyses, and producing deliverables that are common in professional environments. It's a move towards understanding does business AI really give back our time through tangible application rather than theoretical scores.

This distinction is vital for anyone considering or currently using AI tools. A model that excels at "Humanity's Last Exam" might still fall short on specific tasks relevant to your workflow. The exam itself has a strong academic leaning, with mathematics accounting for 41% of its questions, and physics, biology, and computer science making up a significant portion of the remainder. If your work involves communication, creative writing, or nuanced problem-solving, benchmark scores might offer little insight into a model's true utility. For instance, while AI can boost LinkedIn with 5 ChatGPT prompts, these tasks are far removed from deciphering ancient scripts.

Practical Steps for Evaluating AI

Instead of relying solely on generalised benchmarks, a more pragmatic approach involves customising your evaluation. Define what you genuinely need AI to accomplish, then test different models against those specific criteria. This method ensures you're selecting an AI tool based on its practical capabilities for your needs, rather than how it performs on an abstract, high-level test.

As Professor Subbarao Kambhampati of Arizona State University, and former president of the Association for the Advancement of Artificial Intelligence, aptly puts it:

Humanity’s essence isn’t captured by a static test but rather by our ability to evolve and tackle previously unimaginable questions.

This highlights that AI's true value lies in its ability to support and augment human problem-solving, not replace it with a simulated understanding.

While discussions about superintelligence often dominate headlines, the immediate focus should remain on developing AI that is genuinely relevant and useful in everyday life and professional settings. Identifying AI images with 6 clues and free tools is a practical application, far more immediate than solving theoretical problems the )^.

Similar articles can be found at the UK's Department for Science, Innovation and Technology

What are your thoughts on using tailored benchmarks for AI evaluation? Share your experiences and opinions in the comments below.

Cookie Consent

AI Can't Pass 'Humanity's Last Exam'

AI Snapshot

The Problem with Traditional AI Benchmarks

Beyond the Scoreboard: Real-World Usefulness

Practical Steps for Evaluating AI

Share your thoughts

This is a developing story

Liked this? There's more.

Comments (0)