Skip to main content

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Cookie Policy

AI in ASIA
Life

AI Can't Pass 'Humanity's Last Exam'

New 'Humanity's Last Exam' reveals AI's surprising limitations as even advanced models like Gemini 3 Pro struggle with graduate-level questions.

Intelligence DeskIntelligence Desk4 min read

AI Snapshot

The TL;DR: what matters, fast.

New 'Humanity's Last Exam' features 2,500 graduate-level questions across multiple domains

Leading AI model Gemini 3 Pro achieves only 38.3% accuracy on the challenging benchmark

Traditional AI benchmarks may not accurately reflect genuine machine intelligence capabilities

The Academic Test That Reveals AI's True Limitations

New research dubbed "Humanity's Last Exam" is challenging how we measure artificial intelligence, posing questions so complex that even today's most advanced AI models struggle to answer them. Published in Nature, this benchmark aims to pinpoint the current limitations of AI, moving beyond conventional tests that models often "cram" for.

The exam compiles 2,500 questions, each requiring graduate-level expertise across a broad spectrum of subjects, from ancient languages to advanced physics. Crucially, any question an AI could already answer correctly was excluded, ensuring the benchmark truly pushed the boundaries of machine capability.

Nearly a thousand international experts collaborated to craft these problems, highlighting the intricate knowledge gaps that still exist for AI. The results suggest that whilst AI continues to advance rapidly, there remain fundamental barriers to achieving human-level reasoning across diverse domains.

Advertisement

By The Numbers

  • Gemini 3 Pro Preview currently leads with 38.3% accuracy on the benchmark
  • GPT-5 follows at 25.3% accuracy, whilst Grok 4 achieves 24.5%
  • Mathematics accounts for 41% of the exam questions, with physics, biology, and computer science comprising the remainder
  • 96% of Asia-Pacific organisations plan to increase AI investments by an average of 15% in 2026
  • 88% of APAC firms expect ROI of $2.85 per dollar invested in AI by 2026

Why Traditional Benchmarks Miss the Mark

AI developers often use benchmarks as a target, optimising their models to achieve high scores. While this might seem like progress, experts argue it often leads to AI becoming better at specific tests rather than demonstrating genuine, adaptable intelligence.

This phenomenon is evident in the "Humanity's Last Exam" results. Since its online publication in early 2025, AI scores have climbed steadily. However, these improvements don't automatically indicate that models are approaching human-level intelligence. Instead, they suggest AI systems are becoming more adept at the types of questions featured in this specific exam.

The issue mirrors broader concerns about how AI creates a new "meaning" of work, where the focus shifts to output rather than genuine cognitive understanding. When models optimise for benchmarks, they improve their ability to answer particular question formats without necessarily developing underlying comprehension or reasoning capabilities.

"Organizations in Southeast Asia are aligned with APAC's broader view on sovereign AI, with compliance, data security, and governance as top investment drivers," said Ambe Tierro, country managing director and technology lead at Accenture in the Philippines.

The Shift Towards Real-World Assessment

Recognising the limitations of benchmark-driven development, the industry is beginning to shift focus. OpenAI has introduced GDPval, a new metric designed to assess the real-world usefulness of AI. This measure evaluates AI performance on practical tasks like drafting project documents, conducting data analyses, and producing deliverables common in professional environments.

This distinction is vital for anyone considering or currently using AI tools. A model that excels at "Humanity's Last Exam" might still fall short on specific tasks relevant to your workflow. The exam itself has a strong academic leaning, creating a potential disconnect between theoretical performance and practical utility.

The gap becomes apparent when considering everyday applications. While AI struggles with graduate-level physics problems, it can effectively support professionals in Asia's expanding AI workplace adoption through more targeted, domain-specific applications.

Assessment Type Focus Area Real-World Relevance
Traditional Benchmarks Academic knowledge Limited practical application
GDPval Metrics Professional tasks Direct workplace utility
Domain-Specific Tests Industry requirements High relevance for specialists
User-Defined Criteria Personal workflows Maximum practical value
"A lot of countries are putting guardrails around AI and looking to pass legislation around the adoption of AI," said Nigel Lee, general manager for Singapore at Lenovo.

Asia's Practical AI Adoption Strategy

Across Asia-Pacific, organisations are taking a more pragmatic approach to AI evaluation and implementation. Rather than chasing benchmark scores, companies are focusing on practical applications that deliver measurable business value.

Key areas of focus include:

  • Sovereign AI development for data security and regulatory compliance
  • Hybrid infrastructure models that balance performance with governance requirements
  • Domain-specific applications tailored to local market needs
  • Integration with existing workflows rather than wholesale replacement
  • Emphasis on ROI measurement over theoretical capabilities

This approach reflects the reality that AI's impact varies significantly across different sectors and use cases. What works in one context may prove inadequate in another, regardless of benchmark performance.

The regional focus on AI companions and personal applications demonstrates how practical utility often trumps theoretical sophistication. Users gravitate towards AI tools that solve immediate problems rather than those that excel at abstract reasoning tasks.

Evaluating AI for Your Needs

Instead of relying solely on generalised benchmarks, a more pragmatic approach involves customising your evaluation. Define what you genuinely need AI to accomplish, then test different models against those specific criteria.

This method ensures you're selecting an AI tool based on its practical capabilities for your needs, rather than how it performs on abstract, high-level tests. Consider factors such as accuracy in your domain, integration capabilities, cost-effectiveness, and alignment with your organisation's governance requirements.

How should I evaluate AI models for my specific use case?

Start by clearly defining your requirements, then test models on representative tasks from your workflow. Focus on accuracy, speed, and integration capabilities rather than general benchmark scores.

Why do AI benchmark scores keep improving if the models aren't getting smarter?

Models optimise for specific test formats rather than developing genuine understanding. Higher scores often reflect better test-taking ability rather than improved reasoning capabilities.

What is sovereign AI and why does it matter in Asia?

Sovereign AI refers to locally controlled systems that comply with regional data governance requirements. It's increasingly important as Asian countries implement stricter AI regulations.

Should I wait for better AI models before adopting the technology?

Focus on current practical applications rather than waiting for theoretical improvements. Many AI tools already provide significant value for specific use cases today.

How can I avoid choosing AI based on misleading benchmarks?

Test models on your actual tasks and data. Prioritise real-world performance metrics over academic test scores that may not reflect practical utility.

The AIinASIA View: "Humanity's Last Exam" reveals an important truth: academic benchmarks alone cannot capture AI's practical value. We believe the industry's shift towards real-world assessment metrics like GDPval represents a maturation in how we evaluate AI systems. For organisations across Asia, this means focusing on domain-specific performance and regulatory compliance rather than chasing abstract benchmark scores. The future belongs to AI that solves actual problems, not theoretical puzzles.

While discussions about superintelligence often dominate headlines, the immediate focus should remain on developing AI that is genuinely relevant and useful in everyday life and professional settings. The most successful AI implementations will likely come from understanding specific needs rather than pursuing general intelligence metrics.

As the AI landscape continues to evolve, particularly with Asia's expanding focus on practical applications, the emphasis on real-world utility over benchmark performance becomes increasingly important. What matters most is not whether AI can solve graduate-level physics problems, but whether it can effectively support the tasks that matter to you and your organisation.

What are your thoughts on using tailored benchmarks for AI evaluation? Have you found traditional metrics helpful in choosing AI tools, or do you prefer testing models on your specific use cases? Drop your take in the comments below.

YOUR TAKE

We cover the story. You tell us what it means on the ground.

What did you think?

Share your thoughts

Be the first to share your perspective on this story

This is a developing story

We're tracking this across Asia-Pacific and may update with new developments, follow-ups and regional context.

Advertisement

Advertisement

This article is part of the Research Radar learning path.

Continue the path →
Loading comments...

Privacy Preferences

We and our partners share information on your use of this website to help improve your experience. For more information, or to opt out click the Do Not Sell My Information button below.