Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements
A structured taxonomy of LLM safety risks provides a reference baseline for APS agencies developing AI risk registers or evaluation criteria.
Key points
- A 2023 survey catalogues seven core safety risks in large language models, from toxicity to malicious use.
- Covers evaluation techniques and improvement strategies across data preparation, training, and deployment phases.
- A research synthesis rather than policy guidance - useful as background reading, not directly actionable for APS.
Summary
This MIT AI Risk Repository summary covers an academic survey that systematically catalogues safety risks in generative language models across seven categories: toxic content, discrimination, ethics and morality, controversial opinions, misleading information, privacy leakage, and malicious use. The paper also reviews safety evaluation methodologies - including adversarial testing and preference-based assessment - and improvement strategies spanning the model development lifecycle. While the paper itself is from 2023, the MIT repository's inclusion signals ongoing academic consensus-building around LLM risk classification. The taxonomy is broadly compatible with frameworks used in Australian AI governance contexts, including the DISR Responsible AI framework.
Implications for Australian agencies
- Consider Agencies developing AI risk registers or procurement evaluation criteria could use this taxonomy to cross-check coverage of LLM-specific safety categories.
- Monitor Policy teams tracking international AI safety research may want to monitor the MIT AI Risk Repository for further framework summaries that inform Australian standards development.
Implications are AI-generated. Starting points, not advice.
"Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements" Source: MIT AI Risk Repository – Blog Published: 18 September 2024 URL: https://airisk.mit.edu/blog/towards-safer-generative-language-models-a-survey-on-safety-risks-evaluations-and-improvements This MIT AI Risk Repository summary covers an academic survey that systematically catalogues safety risks in generative language models across seven categories: toxic content, discrimination, ethics and morality, controversial opinions, misleading information, privacy leakage, and malicious use. The paper also reviews safety evaluation methodologies - including adversarial testing and preference-based assessment - and improvement strategies spanning the model development lifecycle. While the paper itself is from 2023, the MIT repository's inclusion signals ongoing academic consensus-building around LLM risk classification. The taxonomy is broadly compatible with frameworks used in Australian AI governance contexts, including the DISR Responsible AI framework. Implications for Australian agencies: - [Consider] Agencies developing AI risk registers or procurement evaluation criteria could use this taxonomy to cross-check coverage of LLM-specific safety categories. - [Monitor] Policy teams tracking international AI safety research may want to monitor the MIT AI Risk Repository for further framework summaries that inform Australian standards development. Retrieved from SIMS, 18 May 2026.