Submit Your Toughest Questions for Humanity's Last Exam
Benchmark saturation matters for AI governance because it complicates capability assessments that underpin risk frameworks and procurement decisions.
Key points
- CAIS and Scale AI are crowdsourcing expert-level questions to build a harder AI capability benchmark.
- Existing benchmarks like MMLU are saturated; frontier models now score near ceiling on popular tests.
- This is a research participation call, not policy guidance - low direct relevance for APS practitioners.
Summary
The Centre for AI Safety and Scale AI have launched 'Humanity's Last Exam', an initiative to build a harder public AI benchmark by crowdsourcing expert-level questions across all fields. The project is motivated by the rapid saturation of existing benchmarks like MMLU, which frontier models now approach ceiling performance on, making it difficult to assess how close AI systems are to expert-level capability. Contributors whose questions are accepted are offered co-authorship and a share of a $500,000 prize pool. The submission deadline was 1 November 2024.
Implications for Australian agencies
- Monitor Agencies involved in AI capability assessment or procurement evaluation may want to monitor whether Humanity's Last Exam becomes a reference benchmark in vendor or safety evaluation contexts.
Implications are AI-generated. Starting points, not advice.
"Submit Your Toughest Questions for Humanity's Last Exam" Source: Centre for AI Safety – Blog Published: (undated) URL: https://safe.ai/blog/humanitys-last-exam The Centre for AI Safety and Scale AI have launched 'Humanity's Last Exam', an initiative to build a harder public AI benchmark by crowdsourcing expert-level questions across all fields. The project is motivated by the rapid saturation of existing benchmarks like MMLU, which frontier models now approach ceiling performance on, making it difficult to assess how close AI systems are to expert-level capability. Contributors whose questions are accepted are offered co-authorship and a share of a $500,000 prize pool. The submission deadline was 1 November 2024. Implications for Australian agencies: - [Monitor] Agencies involved in AI capability assessment or procurement evaluation may want to monitor whether Humanity's Last Exam becomes a reference benchmark in vendor or safety evaluation contexts. Retrieved from SIMS, 18 May 2026.