Cookie Consent

    We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

    Install AIinASIA

    Get quick access from your home screen

    Life

    AI Can't Pass 'Humanity's Last Exam'

    Think AI's all-conquering? New research, "Humanity's Last Exam," reveals even advanced models stumble. Discover why they're failing this ultimate test.

    Anonymous
    4 min read3 February 2026
    AI Can't Pass 'Humanity's Last Exam'

    AI Snapshot

    The TL;DR: what matters, fast.

    A new benchmark, "Humanity's Last Exam", challenges AI with 2,500 graduate-level questions designed to expose current limitations.

    The exam specifically excludes questions AI can already answer, aiming to test genuine, adaptable intelligence rather than memorisation.

    While AI scores on the exam have improved since early 2025, this indicates better performance on this specific test, not necessarily a fundamental leap towards human-level intelligence.

    Who should pay attention: AI researchers | Ethicists | Developers

    What changes next: Debate is likely to intensify regarding effective AI benchmarking.

    New research, dubbed "Humanity's Last Exam", is challenging how we measure artificial intelligence, posing questions so complex that even today's most advanced AI models struggle to answer them.

    Published in Nature, this benchmark aims to pinpoint the current limitations of AI, moving beyond conventional tests that models often "cram" for.

    The exam compiles 2,500 questions, each requiring graduate-level expertise across a broad spectrum of subjects, from ancient languages to advanced physics.

    Crucially, any question an AI could already answer correctly was excluded, ensuring the benchmark truly pushed the boundaries of machine capability.

    Nearly a thousand international experts collaborated to craft these problems, highlighting the intricate knowledge gaps that still exist for AI.

    The Problem with Traditional AI Benchmarks

    AI developers often use benchmarks as a target, optimising their models to achieve high scores. While this might seem like progress, experts argue it often leads to AI becoming better at specific tests rather than demonstrating genuine, adaptable intelligence. As an analogy, think of a student memorising answers for an exam versus truly understanding the subject matter. When AI models are solely optimised for benchmarks, they improve their ability to answer those particular questions, not necessarily their underlying comprehension or reasoning.

    This phenomenon is evident in the "Humanity's Last Exam" results. Since its online publication in early 2025, AI scores have climbed. Gemini 3 Pro Preview currently leads with 38.3% accuracy, followed by GPT-5 at 25.3% and Grok 4 at 24.5%. While these figures suggest improvement, it doesn't automatically mean these models are approaching human-level intelligence. Instead, it indicates they're becoming more adept at the types of questions featured in this specific exam. We've seen similar issues with AI creating a new "meaning" of work, not just the outputs, where the focus shifts to output rather than genuine cognitive tasks.

    Beyond the Scoreboard: Real-World Usefulness

    Enjoying this? Get more in your inbox.

    Weekly AI news & insights from Asia.

    Recognising the limitations of benchmark-driven development, the industry is beginning to shift focus. OpenAI, for instance, has introduced GDPval, a new metric designed to assess the real-world usefulness of AI. This measure evaluates AI performance on practical tasks like drafting project documents, conducting data analyses, and producing deliverables that are common in professional environments. It's a move towards understanding does business AI really give back our time through tangible application rather than theoretical scores.

    This distinction is vital for anyone considering or currently using AI tools. A model that excels at "Humanity's Last Exam" might still fall short on specific tasks relevant to your workflow. The exam itself has a strong academic leaning, with mathematics accounting for 41% of its questions, and physics, biology, and computer science making up a significant portion of the remainder. If your work involves communication, creative writing, or nuanced problem-solving, benchmark scores might offer little insight into a model's true utility. For instance, while AI can boost LinkedIn with 5 ChatGPT prompts, these tasks are far removed from deciphering ancient scripts.

    Practical Steps for Evaluating AI

    Instead of relying solely on generalised benchmarks, a more pragmatic approach involves customising your evaluation. Define what you genuinely need AI to accomplish, then test different models against those specific criteria. This method ensures you're selecting an AI tool based on its practical capabilities for your needs, rather than how it performs on an abstract, high-level test.

    As Professor Subbarao Kambhampati of Arizona State University, and former president of the Association for the Advancement of Artificial Intelligence, aptly puts it:

    "Humanity’s essence isn’t captured by a static test but rather by our ability to evolve and tackle previously unimaginable questions."

    This highlights that AI's true value lies in its ability to support and augment human problem-solving, not replace it with a simulated understanding.

    While discussions about superintelligence often dominate headlines, the immediate focus should remain on developing AI that is genuinely relevant and useful in everyday life and professional settings. Identifying AI images with 6 clues and free tools is a practical application, far more immediate than solving theoretical problems the )^.

    Similar articles can be found at the UK's Department for Science, Innovation and Technology

    What are your thoughts on using tailored benchmarks for AI evaluation? Share your experiences and opinions in the comments below.

    Anonymous
    4 min read3 February 2026

    Share your thoughts

    Be the first to share your perspective on this story

    No comments yet. Be the first to share your thoughts!

    Leave a Comment

    Your email will not be published