A new benchmark suggests we’re inching toward AI doing parts of real jobs — but the full picture is still far from clear.
Could GPT‑5 already be doing the work of a software engineer, lawyer or nurse at least part of it? That is the provocative claim behind GDPval, a new evaluation by OpenAI that pits its models against human professionals across 44 occupations. The early results are striking, but they require nuance. This is not about AI replacing humans just yet it’s about measuring whether AI can already assist at a professional level.
GDPval is a new benchmark testing AI on real‑world deliverables (reports, blueprints, briefs) in 44 occupations across nine key industries. On this benchmark, GPT‑5 (high configuration) is rated “as good as or better than experts” about 40.6 % of the time. Anthropic’s Claude Opus 4.1 outperforms GPT‑5 here. It wins or ties ~49 % of the time. But GDPval’s current form is one‑shot, narrow in scope. It doesn’t capture iterative workflows, ambiguity, stakeholder interaction or long projects. The key takeaway: GPT‑5 is moving into territory where it can assist meaningfully in professional tasks. But full substitution of human roles remains distant.
What Is GDPval — and Why It Matters
OpenAI describes GDPval as an evaluation of economically valuable, real‑world tasks drawn from roles in industries that contribute heavily to GDP. OpenAI It differs from classic benchmarks (math puzzles, multiple choice, synthetic tests) by asking models to generate deliverables — documents, diagrams, slides, plans — based on realistic context and reference files. OpenAI+1
The benchmark covers 44 occupations (software development, engineering, nursing, legal, financial analysis, among others) across nine sectors. OpenAI+1 For each task, human graders (experts in the same domain) compare AI output with a human expert’s version (blind) and rate whether the AI’s output is better, as good, or worse. OpenAI+1
Its ambition: to shift evaluation of AI from isolated puzzles to work‑relevant performance. If a model can already pass parts of what professionals do, it changes how businesses adopt and trust AI.
What the Results Show (and Don’t)
Encouraging gains, but not dominance
GPT‑5’s “win or tie” rate — ~40.6 % in its “high” mode — is a big leap over previous models. By comparison, GPT‑4o scored ~13.7 %.
But even 40 % is not “AI wins most of the time.” In many tasks, it still trails human experts. The benchmark is more about getting close in selected domains than sweeping dominance.
⚖️ Claude outpaces GPT‑5 in this test
Claude Opus 4.1 achieves ~49 % win/tie rate on the same benchmark — a notable margin over GPT‑5. OpenAI suggests Claude may benefit from stylistic and formatting appeal (nicer graphics, layout) in the judging, not purely content superiority.
This underlines that presentation matters in these comparisons, not just factual correctness or reasoning depth.
Limitations are significant
Enjoying this? Get more in your inbox.
Weekly AI news & insights from Asia.
One‑shot format: each task is judged in a single pass, without room for revision, feedback loops or back-and-forth with stakeholders.,Narrow scope of “job work”: many professional roles involve ambiguity, negotiation, collaboration, evolving constraints, meetings, client interaction — none of which are captured in GDPval‑v0.,Judging bias: visual polish, formatting, readability might influence human graders. That can advantage models which produce clean layouts, not necessarily deeper insight.,Not generalisable to every role: tasks are drawn from 44 occupations — many roles or tasks outside those are untested.
So while impressive, the results are suggestive, not conclusive.
Why This Matters (for Business, AI Adoption, Strategy)
1. Productivity uplift, not wholesale displacement
Even if GPT‑5 can’t replace a professional entirely, being able to reliably draft or assist certain deliverables is immensely valuable. Professionals can offload parts of the workflow and focus on judgment, oversight, strategy, ethics. For more on this, check out our piece on what every worker needs to answer: What is your non-machine premium?.
OpenAI itself frames the narrative this way: as models improve, workers can “offload some of their work to the model and do higher‑value work.”
2. Deployment will be selective and domain-specific
Firms will first adopt AI in tasks that are well-defined, structured and lower risk. E.g. generating reports, summarising data, drafting first passes of legal memos. As models prove reliable, they’ll move into more complex areas.
3. Quality control and human oversight remain crucial
Even when AI output looks plausible, errors, hallucinations or context misunderstanding can creep in. Especially in domains like law, medicine, engineering, an errant detail can be costly. Any deployment must include checks, correction workflows, human-in-the-loop review. This highlights the importance of when AI slop needs a human polish.
4. Competitive landscape and hybrid strengths
Claude’s edge in GDPval suggests that different models may dominate different niches (style, presentation, domain depth). Organisations will want to choose or combine models based on their domain demands not assume one “super‑model” wins everywhere. Our article Perplexity vs ChatGPT vs Gemini - five challenges, three contenders explores such comparisons.
5. Benchmark evolution will redefine “capable AI”
GDPval itself is just version 0. OpenAI plans future iterations with interactive workflows, longer tasks, ambiguity, interactivity. Over time, if a model can handle full project cycles, negotiation, error recovery and human collaboration, that’s when we cross a more meaningful threshold.
Evidence from Domain Benchmarks: Medical & Multimodal Gains
Beyond GDPval, GPT‑5 is also assertively improving in technical, domain-intensive evaluations:
In medical reasoning and imaging, GPT‑5 outperforms GPT‑4o in radiology, treatment planning, visual question answering tasks. In a board‑style physics exam subset, GPT‑5 achieved ~90.7 % accuracy vs ~78 % for GPT‑4o, surpassing human pass thresholds.,In ophthalmology QA, GPT‑5-high achieved ~96.5 % accuracy on a specialist dataset, outperforming GPT‑4o and other variants.,In domain integration tasks combining images + text, GPT‑5 shows gains over GPT‑4o in multi‑modal reasoning benchmarks
These results confirm that GPT‑5’s improvements are not just superficial — they reflect deeper gains in reasoning, domain grounding and handling complex, integrated inputs.
What to Watch Next
GDPval v1, v2 & interactive benchmarks Will OpenAI (or others) introduce versions allowing models to revise, ask questions, iterate, collaborate? That will push closer to measuring real job performance.,Real‑world case studies Which firms begin embedding GPT‑5 in domain workflows? What savings, error rates, adoption challenges emerge?,Error analysis & failure modes Where do models still misstep? In ambiguity, domain nuance, edge cases, unexpected constraints — and how often?,Regulation, liability and trust frameworks As models shoulder parts of professional work, who is responsible for mistakes? Accountability, audit trails, transparency will become more urgent.,Model specialization & hybrid stacks We’ll likely see ensembles or hybrid systems: GPT‑5 plus domain-specific fine‑tuned models, or combining its generality with specialist tools (e.g. medical, legal). The “best” model may be a stack, not a standalone.
The GDPval benchmark is a milestone. GPT‑5’s finish — winning or tying ~40% of professional tasks — signals we’re no longer in the realm of futuristic speculation: AI is already doing work that looks like what many professionals do. But we are not yet in the era where AI is professionals.
The transition now is from “AI as toy or research curiosity” to “AI as capable assistant.” The challenge over the next few years is whether AI can leap from assisting fragments of work to reliably navigating full professional workflows. If it does, the im












Latest Comments (3)
This reminds me of a gig I had; the client thought a single deliverable equated to the entire project. Really highlights the gap eh?
Interesting read. While 40% of tasks is impressive, I wonder if this "narrow, one-shot" approach truly reflects real-world performance. In Singapore, we often say "the proof is in the pudding," and a single excellent answer doesn't always translate to handling a full project or nuanced client interactions. It's a promising start, for sure, but maybe we shouldn't get ahead of ourselves just yet.
This benchmark's interesting for Singapore, especially our finance sector. Good to know where the tech stands, but “full job complexity” is really the crux, isn’t it?
Leave a Comment