SafetyBench: Evaluating the Safety of Large Language Models
A structured LLM safety evaluation framework offers APS practitioners a reference taxonomy for assessing AI content risks in procurement or deployment.
Key points
- SafetyBench is a bilingual benchmark evaluating LLM safety across 7 risk categories using 11,435 multiple-choice questions.
- The MIT AI Risk Repository spotlights this as one of 28 AI risk frameworks - useful for agencies mapping evaluation tools.
- The 2023 paper is research-vintage; the MIT blog post adds no new findings beyond summarising the original work.
Summary
SafetyBench, developed by Zhang et al. (2023), is a bilingual (English and Chinese) benchmark for assessing the safety of large language models across seven categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It uses 11,435 multiple-choice questions to evaluate more than 25 LLMs in zero-shot and few-shot settings. The MIT AI Risk Repository blog post spotlights it as the 28th framework in its curated collection, with no new analysis added beyond the original paper.
Implications for Australian agencies
- Consider Agencies developing AI evaluation criteria or procurement specifications could consider SafetyBench's seven safety categories as a reference taxonomy for content-risk assessment.
- Monitor AI governance teams may want to monitor the MIT AI Risk Repository's broader framework series as a consolidated reference for emerging AI risk categorisation approaches.
Implications are AI-generated. Starting points, not advice.
"SafetyBench: Evaluating the Safety of Large Language Models" Source: MIT AI Risk Repository – Blog Published: 13 February 2026 URL: https://airisk.mit.edu/blog/safetybench-evaluating-the-safety-of-large-language-models SafetyBench, developed by Zhang et al. (2023), is a bilingual (English and Chinese) benchmark for assessing the safety of large language models across seven categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It uses 11,435 multiple-choice questions to evaluate more than 25 LLMs in zero-shot and few-shot settings. The MIT AI Risk Repository blog post spotlights it as the 28th framework in its curated collection, with no new analysis added beyond the original paper. Implications for Australian agencies: - [Consider] Agencies developing AI evaluation criteria or procurement specifications could consider SafetyBench's seven safety categories as a reference taxonomy for content-risk assessment. - [Monitor] AI governance teams may want to monitor the MIT AI Risk Repository's broader framework series as a consolidated reference for emerging AI risk categorisation approaches. Retrieved from SIMS, 18 May 2026.