Skip to main content

Cookie Consent

We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. Learn more

Install AIinASIA

Get quick access from your home screen

Install AIinASIA

Get quick access from your home screen

AI in ASIA
GPT-5 GDPval benchmark
Create

What is GDPval - and why it matters

While GPT-5 shows notable progress, matching or outperforming experts in over 40% of tasks, the benchmark has limitations, capturing only narrow, one-shot deliverables rather than full job complexity. We unpack the implications for business and human-AI collaboration.

Anonymous6 min read

AI Snapshot

The TL;DR: what matters, fast.

GDPval is a new OpenAI benchmark evaluating AI models against human professionals in 44 occupations across 9 industries.

GPT-5 achieves a "win or tie" rate of 40.6% on GDPval, indicating its ability to assist meaningfully in professional tasks.

Anthropic's Claude Opus 4.1 outperforms GPT-5 on GDPval with a win or tie rate of approximately 49%.

Who should pay attention: AI developers | Economists | Business leaders

What changes next: The capabilities of large language models will continue to be a key area of research.

A new benchmark suggests we’re inching toward AI doing parts of real jobs — but the full picture is still far from clear.

Could GPT‑5 already be doing the work of a software engineer, lawyer or nurse at least part of it? That is the provocative claim behind GDPval, a new evaluation by OpenAI that pits its models against human professionals across 44 occupations. The early results are striking, but they require nuance. This is not about AI replacing humans just yet it’s about measuring whether AI can already assist at a professional level.

GDPval is a new benchmark testing AI on real‑world deliverables (reports, blueprints, briefs) in 44 occupations across nine key industries. On this benchmark, GPT‑5 (high configuration) is rated “as good as or better than experts” about 40.6 % of the time. Anthropic’s Claude Opus 4.1 outperforms GPT‑5 here. It wins or ties ~49 % of the time. But GDPval’s current form is one‑shot, narrow in scope. It doesn’t capture iterative workflows, ambiguity, stakeholder interaction or long projects. The key takeaway: GPT‑5 is moving into territory where it can assist meaningfully in professional tasks. But full substitution of human roles remains distant.

What Is GDPval — and Why It Matters

OpenAI describes GDPval as an evaluation of economically valuable, real‑world tasks drawn from roles in industries that contribute heavily to GDP. OpenAI It differs from classic benchmarks (math puzzles, multiple choice, synthetic tests) by asking models to generate deliverables — documents, diagrams, slides, plans — based on realistic context and reference files. OpenAI+1

The benchmark covers 44 occupations (software development, engineering, nursing, legal, financial analysis, among others) across nine sectors. OpenAI+1 For each task, human graders (experts in the same domain) compare AI output with a human expert’s version (blind) and rate whether the AI’s output is better, as good, or worse. OpenAI+1

Its ambition: to shift evaluation of AI from isolated puzzles to work‑relevant performance. If a model can already pass parts of what professionals do, it changes how businesses adopt and trust AI.

What the Results Show (and Don’t)

Encouraging gains, but not dominance

GPT‑5’s “win or tie” rate — ~40.6 % in its “high” mode — is a big leap over previous models. By comparison, GPT‑4o scored ~13.7 %.

But even 40 % is not “AI wins most of the time.” In many tasks, it still trails human experts. The benchmark is more about getting close in selected domains than sweeping dominance.

⚖️ Claude outpaces GPT‑5 in this test

Claude Opus 4.1 achieves ~49 % win/tie rate on the same benchmark — a notable margin over GPT‑5. OpenAI suggests Claude may benefit from stylistic and formatting appeal (nicer graphics, layout) in the judging, not purely content superiority.

This underlines that presentation matters in these comparisons, not just factual correctness or reasoning depth.

Limitations are significant

One‑shot format: each task is judged in a single pass, without room for revision, feedback loops or back-and-forth with stakeholders.,Narrow scope of “job work”: many professional roles involve ambiguity, negotiation, collaboration, evolving constraints, meetings, client interaction — none of which are captured in GDPval‑v0.,Judging bias: visual polish, formatting, readability might influence human graders. That can advantage models which produce clean layouts, not necessarily deeper insight.,Not generalisable to every role: tasks are drawn from 44 occupations — many roles or tasks outside those are untested.

So while impressive, the results are suggestive, not conclusive.

Why This Matters (for Business, AI Adoption, Strategy)

1. Productivity uplift, not wholesale displacement

Even if GPT‑5 can’t replace a professional entirely, being able to reliably draft or assist certain deliverables is immensely valuable. Professionals can offload parts of the workflow and focus on judgment, oversight, strategy, ethics. For more on this, check out our piece on what every worker needs to answer: What is your non-machine premium?.

OpenAI itself frames the narrative this way: as models improve, workers can “offload some of their work to the model and do higher‑value work.”

2. Deployment will be selective and domain-specific

Firms will first adopt AI in tasks that are well-defined, structured and lower risk. E.g. generating reports, summarising data, drafting first passes of legal memos. As models prove reliable, they’ll move into more complex areas.

3. Quality control and human oversight remain crucial

Even when AI output looks plausible, errors, hallucinations or context misunderstanding can creep in. Especially in domains like law, medicine, engineering, an errant detail can be costly. Any deployment must include checks, correction workflows, human-in-the-loop review. This highlights the importance of when AI slop needs a human polish.

4. Competitive landscape and hybrid strengths

Claude’s edge in GDPval suggests that different models may dominate different niches (style, presentation, domain depth). Organisations will want to choose or combine models based on their domain demands not assume one “super‑model” wins everywhere. Our article Perplexity vs ChatGPT vs Gemini - five challenges, three contenders explores such comparisons.

5. Benchmark evolution will redefine “capable AI”

GDPval itself is just version 0. OpenAI plans future iterations with interactive workflows, longer tasks, ambiguity, interactivity. Over time, if a model can handle full project cycles, negotiation, error recovery and human collaboration, that’s when we cross a more meaningful threshold.

Evidence from Domain Benchmarks: Medical & Multimodal Gains

Beyond GDPval, GPT‑5 is also assertively improving in technical, domain-intensive evaluations:

In medical reasoning and imaging, GPT‑5 outperforms GPT‑4o in radiology, treatment planning, visual question answering tasks. In a board‑style physics exam subset, GPT‑5 achieved ~90.7 % accuracy vs ~78 % for GPT‑4o, surpassing human pass thresholds.,In ophthalmology QA, GPT‑5-high achieved ~96.5 % accuracy on a specialist dataset, outperforming GPT‑4o and other variants.,In domain integration tasks combining images + text, GPT‑5 shows gains over GPT‑4o in multi‑modal reasoning benchmarks

These results confirm that GPT‑5’s improvements are not just superficial — they reflect deeper gains in reasoning, domain grounding and handling complex, integrated inputs.

What to Watch Next

GDPval v1, v2 & interactive benchmarks Will OpenAI (or others) introduce versions allowing models to revise, ask questions, iterate, collaborate? That will push closer to measuring real job performance.,Real‑world case studies Which firms begin embedding GPT‑5 in domain workflows? What savings, error rates, adoption challenges emerge?,Error analysis & failure modes Where do models still misstep? In ambiguity, domain nuance, edge cases, unexpected constraints — and how often?,Regulation, liability and trust frameworks As models shoulder parts of professional work, who is responsible for mistakes? Accountability, audit trails, transparency will become more urgent.,Model specialization & hybrid stacks We’ll likely see ensembles or hybrid systems: GPT‑5 plus domain-specific fine‑tuned models, or combining its generality with specialist tools (e.g. medical, legal). The “best” model may be a stack, not a standalone.

The GDPval benchmark is a milestone. GPT‑5’s finish — winning or tying ~40% of professional tasks — signals we’re no longer in the realm of futuristic speculation: AI is already doing work that looks like what many professionals do. But we are not yet in the era where AI is professionals.

The transition now is from “AI as toy or research curiosity” to “AI as capable assistant.” The challenge over the next few years is whether AI can leap from assisting fragments of work to reliably navigating full professional workflows. If it does, the im

What did you think?

Written by

Share your thoughts

Join 5 readers in the discussion below

Latest Comments (5)

Crystal
Crystal@crystalwrites
AI
25 October 2025

This is really interesting, especially comparing GPT-5 with Claude Opus 4.1's performance! I've been using Claude for some content generation tasks lately, and it often feels more nuanced. Maybe this benchmark explains why!

Amelia Taylor@ameliat
AI
15 October 2025

honestly, this whole "40% of tasks" stat for GPT-5 gives me a chuckle. I still remember a client project last year where I was trying to get a fairly standard data viz brief generated. GPT-4, bless its circuits, kept insisting on putting pie charts everywhere. I explicitly said no pie charts. It just didn't get the nuance. So yeah, "delivers documents based on realistic context"... sure, if that context allows for a lot of manual correction on my end.

Benjamin Ng
Benjamin Ng@benng
AI
10 October 2025

for our LLM-tutor, that "outperforming experts" metric is tricky. we see it in our own internal evals but like GDPval says, it's just one-off tasks not the whole student interaction.

Jasmine Koh@jasminek
AI
3 October 2025

The human-in-the-loop evaluation framework for GDPval is interesting, but I wonder if the "blind" rating sufficiently mitigates potential biases inherent in expert comparisons. How is inter-rater reliability measured?

Lakshmi Reddy
Lakshmi Reddy@lakshmi.r
AI
1 October 2025

This "human graders" part for GDPval always makes me wonder. For Indic languages, finding reliable, expert human annotators in specific domains is already a huge challenge. If the benchmark expands globally, how will they ensure consistent, culturally informed grading, beyond just language proficiency?

Leave a Comment

Your email will not be published