ChatGPT Now Competes with Humans in Professional Tasks: This is Demonstrated by OpenAI's New Benchmark
The question is no longer whether artificial intelligence can do your job, but when. OpenAI has taken a decisive step to answer this with the launch of GDPval, a benchmark that directly evaluates the ability of AI models to perform professional tasks that generate real economic value.
GDPval —named after Gross Domestic Product (GDP)— does not just measure technical skills; it tests AI in the context of the real economy. The goal is clear: to determine if current models can replace professionals in sectors that represent the majority of the U.S. GDP.
The study covers 44 professions, from journalists and financial consultants to software engineers and nurses. The methodology is robust: human professionals generate real reports and deliverables, which are then compared to those produced by models like GPT-5 and Claude Opus. A panel of independent experts evaluates which work they prefer, without knowing whether it was done by a person or an AI.
The results are revealing. The most advanced models already rival qualified professionals in generating documents, analyses, and recommendations. To put it in perspective, GPT-4o, released just 15 months earlier, received a 13.7% preference in this test. In just a year and a half, AI has tripled its performance compared to humans, dangerously approaching parity.
This advancement is not just technical but structural. If the trend continues, many professions based on report generation, analysis, and summaries could be profoundly transformed. Knowledge work is increasingly becoming a flow of inputs and outputs that AI can replicate with growing efficiency.
However, GDPval has significant limitations. It only evaluates the generation of written deliverables, leaving out key aspects of human work such as strategic decision-making, interpersonal communication, team management, or adaptability to complex situations. Dr. Aaron Chatterji, chief economist at OpenAI, acknowledges that most professionals do much more than write reports. The challenge for the future will be to develop benchmarks that capture all that complexity.
Still, OpenAI argues that these advancements already allow for the freeing up of time for more valuable tasks: if AI can handle the documentation part, humans can focus on innovating and making decisions.
Until now, the most commonly used benchmarks in AI were academic in nature: solving mathematical problems, logic, text comprehension, etc. However, these tests have become less useful as references, as the most advanced models easily surpass them. GDPval represents a new generation of evaluations, much more aligned with the needs of businesses, governments, and professionals looking to anticipate the real impact of AI in their sectors.
In short, artificial intelligence is no longer just a promise for the future: it is starting to compete, head-to-head, with humans in tasks that until recently seemed exclusive to our species. The debate is no longer whether AI can do our jobs, but how we will adapt to coexist —and compete— with it.