Humanity's Last Exam
Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 expert-level questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI, and was designed to test reasoning abilities and human-like intelligence, as opposed to just pattern recognition.
History
[edit]Benchmark tests like Humanity's Last Exam have long been used to evaluate reasoning and learning capabilities in machines [1]. Early benchmarks, such as the Turing Test, measured whether machines could demonstrate human-like conversation abilities [2]. Other early benchmark tests evaluated computer vision, like MNIST for handwritten digit recognition and ImageNet for continual image classification[3]. The emergence of large language models (LLMs) in the 2020s led to the advancement and evolution of benchmark tests, with a focus on emphasizing interpretability, reproducibility, and clearer evaluation criteria. Recent foundation model benchmarks, such as MMLU, HellaSwag, and ARC Challenge, illustrate this shift.[4]
Creation
[edit]Humanity’s Last Exam was created to parallel the quick progression of LLMs and provide a proper assessment of these models. Previous benchmarks evaluated LLMs with about 90% correctness creating the need for a more difficult exam.[5] Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation".[6] The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions.[7] The questions were crowdsourced from subject matter experts from various institutions across the world.[8][9] The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts for accuracy and wording in two rounds, and then approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000 U.S. dollars—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset".[9] AI systems are able to surpass more focused, task-oriented tests, yet few are able to perform well on broader, general ability assessments.[10] HLE was designed to test reasoning abilities, which are considered a metric of “human” intelligence.[11]
Composition
[edit]The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark overfitting.[9]
An example question:[7]
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
An independent investigation by FutureHouse, published in July 2025, suggested that around 30% of the HLE answers for text-only chemistry and biology questions could be incorrect; the benchmark's team partially replicated the findings, and said they hope to institute a continuous revisions process.[12] Questions are not available online protect fidelity and prevent the answers from becoming easily accessible via web search.[13] Responses are graded as correct if they completely match the expert answer, and everything else is given zero credit. Only the test question is allowed as a prompt; users cannot continue to guide the model with follow-up questions. Final scores are reported in percent accuracy.[14]
Results
[edit]| Organization | Model | Accuracy (%) ↑ | Calibration Error (%) ↓ |
|---|---|---|---|
| Google DeepMind | Gemini 3 Pro Preview | 37.52 | 57 |
| OpenAI | GPT-5 Pro | 31.64 | 49 |
| Anthropic | Claude Sonnet 4.5 (Thinking) | 13.72 | 65 |
| Zhipu AI | GLM 4.5 | 8.32 | 79 |
| Meta AI | Llama 4 Maverick | 5.68 | 83 |
| Mistral AI | Mistral Medium 3 | 4.52 | 77 |
| Amazon Web Services | Nova Pro | 4.40 | 80 |
| Organization | Model | Accuracy (%) ↑ | Calibration Error (%) ↓ |
|---|---|---|---|
| OpenAI | gpt-oss-120b | 15.48 | 76 |
| Alibaba Cloud | Qwen3-235B-A22B-Thinking-2507 | 15.43 | 78 |
| DeepSeek | DeepSeek-R1-0528 | 14.04 | 78 |
| Moonshot AI | Kimi-K2-Instruct | 4.68 | 82 |
| Amazon Web Services | Nova Micro | 4.41 | 84 |
Controversies
[edit]Several commentators have raised concerns regarding the development and use of AI benchmark tests. In an interview, Elon Musk expressed uncertainty about efforts to create AI systems that surpass human intelligence. Musk showed a lack of concern about the future of AI if it were to surpass humanity in intelligence by saying the achievement would probably be a good thing.[15] Additionally, the exam’s organizers refrain from including questions about weapons due to AI safety concerns. The weapon exclusion is one of the only safeguards placed on this exam.[8] The Oxford Internet Institute (OII) expressed skepticism about the lack of statistical continuity, suggesting that there is an element of luck involved in AI superiority. This lack of consistency is due to variance in the AI’s results for the exam leading researchers to believe that luck is present. Oxford Internet Institute (OII) also highlights how not defining terms is creating confusion, and without a shared understanding of the concepts from the exam, it would be difficult to determine if the exam is accomplishing its intended purpose.[16]
References
[edit]- ^ "Humanity's Last Exam: The AI Benchmark for LLM Reasoning". IntuitionLabs. Retrieved 2025-11-20.
- ^ Pinar Saygin, Ayse; Cicekli, Ilyas; Akman, Varol (2000-11-01). "Turing Test: 50 Years Later". Minds and Machines. 10 (4): 463–518. doi:10.1023/A:1011288000451. ISSN 1572-8641.
- ^ Faber, Kamil; Zurek, Dominik; Pietron, Marcin; Japkowicz, Nathalie; Vergari, Antonio; Corizzo, Roberto (2024-10-01). "From MNIST to ImageNet and back: benchmarking continual curriculum learning". Machine Learning. 113 (10): 8137–8164. doi:10.1007/s10994-024-06524-z. ISSN 1573-0565.
- ^ Reuel, Anka (20 November 2024). "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices". arXiv.
- ^ Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].
- ^ Maslej, Nestor; et al. (April 2025). The AI Index 2025 Annual Report (PDF) (Report). Institute for Human-Centered AI. pp. 141–142.
- ^ a b Roose, Kevin (23 January 2025). "When A.I. Passes This Test, Look Out". New York Times. Archived from the original on 29 January 2025. Retrieved 24 January 2025.
- ^ a b Dastin, Jeffrey; Paul, Katie (16 September 2024). "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters. Archived from the original on 8 April 2025. Retrieved 24 January 2025.
- ^ a b c Phan, Long; et al. (2025). "Humanity's Last Exam". arXiv:2501.14249 [cs.LG].
- ^ José Hernández-Orallo (2016). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review. 1-51. doi:10.1007/s10462-016- 9505-7. url: https://riunet.upv.es/server/api/core/bitstreams/52884250-5f37-43f6-b966-014799bfac28/content
- ^ "Humanity's Last Exam: AI vs Human Benchmark Results | Galileo". Galileo AI. Retrieved 2025-11-20.
- ^ Skarlinski, Michael; Laurent, Jon; Bou, Albert; White, Andrew (16 September 2025). "About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong". FutureHouse. Retrieved 15 October 2025.
- ^ "Humanity's Last Exam: AI vs Human Benchmark Results | Galileo". Galileo AI. Retrieved 2025-11-20.
- ^ "Humanity's Last Exam: The AI Benchmark for LLM Reasoning". IntuitionLabs. Retrieved 2025-11-20.
- ^ Béchard, Deni Ellis. "Elon Musk's New Grok 4 Takes on 'Humanity's Last Exam' as the AI Race Heats Up". Scientific American. Retrieved 2025-11-13.
- ^ "OII | Study identifies weaknesses in how AI systems are evaluated". www.oii.ox.ac.uk. Retrieved 2025-11-13.