Model Evaluation for Extreme Risks
Dangerous-capability taxonomies like this one inform how safety institutes—including Australia's AISI—design pre-deployment evaluation criteria.
Key points
- A 2023 DeepMind-led paper proposes model evaluation frameworks targeting nine dangerous AI capability categories.
- The framework covers cyber-offense, deception, manipulation, weapons acquisition, and self-proliferation as extreme risk vectors.
- Primarily a research synthesis by MIT AI Risk Repository; the underlying paper predates recent Australian AI safety evaluation work.
Summary
MIT's AI Risk Repository spotlights a 2023 paper by Shevlane, Farquhar, Garfinkel and co-authors proposing that model evaluation can address extreme AI risks by assessing both dangerous capabilities and model alignment. The framework identifies nine capability categories—including cyber-offense, deception, persuasion, weapons acquisition, and self-proliferation—through which general-purpose AI systems could cause catastrophic harm. The paper outlines how such evaluations could be embedded in safety and governance processes for training and deployment. This MIT blog post is a summary entry in a broader risk framework repository rather than new primary research.
Implications for Australian agencies
- Monitor Australian AISI and DISR policy staff may want to monitor how this dangerous-capabilities taxonomy is being adopted or adapted in peer-jurisdiction evaluation regimes.
- Consider Agencies developing AI risk assessment or procurement criteria could consider whether the nine capability categories provide a useful checklist for high-stakes AI acquisitions.
Implications are AI-generated. Starting points, not advice.
"Model Evaluation for Extreme Risks" Source: MIT AI Risk Repository – Blog Published: 6 February 2026 URL: https://airisk.mit.edu/blog/model-evaluation-for-extreme-risks MIT's AI Risk Repository spotlights a 2023 paper by Shevlane, Farquhar, Garfinkel and co-authors proposing that model evaluation can address extreme AI risks by assessing both dangerous capabilities and model alignment. The framework identifies nine capability categories—including cyber-offense, deception, persuasion, weapons acquisition, and self-proliferation—through which general-purpose AI systems could cause catastrophic harm. The paper outlines how such evaluations could be embedded in safety and governance processes for training and deployment. This MIT blog post is a summary entry in a broader risk framework repository rather than new primary research. Implications for Australian agencies: - [Monitor] Australian AISI and DISR policy staff may want to monitor how this dangerous-capabilities taxonomy is being adopted or adapted in peer-jurisdiction evaluation regimes. - [Consider] Agencies developing AI risk assessment or procurement criteria could consider whether the nine capability categories provide a useful checklist for high-stakes AI acquisitions. Retrieved from SIMS, 18 May 2026.