ChatGPT, Gemini, Claude, and DeepSeek have been evaluated in Humanity’s Last Exam, a rigorous test published in Nature that measures their performance against human experts and reignites the debate about the proximity of artificial general intelligence (AGI). The results show significant progress, although the gap with human-level performance remains considerable. The exam, developed by the Center for AI Safety and Scale AI, was introduced in January 2025 as a new standard for measuring the actual capabilities of large language models. Unlike other benchmarks, this test aims to determine whether systems like GPT-4o, Gemini, Claude, or DeepSeek can approach human specialized knowledge in various fields. The study, published in Nature on January 28, describes an evaluation of 2,500 questions covering over 100 disciplines. More than 1,000 experts from 500 institutions in 50 countries participated in its development, under strict criteria: precise, verifiable questions that cannot be solved with a simple internet search. The organizers of Humanity’s Last Exam discarded any questions that could be easily found online or that the models answered correctly in earlier phases. Out of about 70,000 initial proposals, only 13,000 passed the automatic filter by challenging the AI systems. After further review by specialists, the number was reduced to 2,500 doctoral-level questions, ranging from Greek mythology to advanced physics problems about forces and motion in ideal systems. At the launch of the test, OpenAI placed its model o1 in first place with only an 8.3% accuracy rate. Researchers anticipated that, at the current pace of advancement, models could exceed 50% by the end of 2025, a prediction that did not seem far-fetched. As of February 12, 2026, the highest score belongs to Gemini 3 Deep Think, with 48.4%. This figure contrasts with the performance of human experts, which hovers around 90% in their respective fields. Thus, artificial intelligence stands at a midpoint: competent, but still far from expert mastery. The authors of the study caution about the limits of this metric: “A high accuracy in HLE would demonstrate expert-level performance on closed and verifiable questions and advanced scientific knowledge, but it does not alone imply autonomous research capabilities or general artificial intelligence,” they state in the article, suggesting that the arrival of AGI is still distant. Source: elconfidencial.com

Gialoma

Challenge to the Last Exam of Humanity: ChatGPT, Gemini, Claude, and DeepSeek Reveal the Technological Future

Paloma Firgaira

Articles Récents

Privacy Alert: Chrome Installs Gemini Nano on Your PC Without Consent

The Dangerous Vertigo Facing the Abyss of Artificial Intelligence

Antonio Díaz (Evolves): "AI has roots since 2002, generative is just the current evolution."

Catégories Populaires