This week's digest centres on a theme that runs across multiple items: the limits of current AI evaluation methods and what that means for agencies making deployment decisions. Australia's participation in the International Network for Advanced AI Measurement, Evaluation, and Science brings direct relevance, with the network publishing preliminary consensus on automated evaluation practices at a moment when research from Oxford is challenging the reliability of standard benchmarks in real-world, high-stakes settings — findings that should prompt reflection for any agency assessing AI tools for public-facing or sensitive functions. Practitioners with responsibilities in crisis communications or emergency management will also want to note the Alan Turing Institute's call for urgent action on AI-generated information threats during crisis events, even accounting for the report's limited publicly available detail. Rounding out the week, the MIT AI Risk Repository has spotlighted two safety benchmarking frameworks that, while primarily academic in origin, offer reference points for teams building or reviewing AI risk taxonomies.
The International Network for Advanced AI Measurement, Evaluation, and Science - a ten-country body founded by NIST's CAISI in November 2024 - has published preliminary consensus on key practices and open questions for automated AI evaluation. Australia is a member alongside the US, UK, EU, Canada, Japan, Singapore, France, Kenya, and South Korea. The consensus draws on a December 2025 workshop held at NeurIPS and reflects CAISI's draft Best Practices for Automated Benchmark Evaluations, currently open for public comment. The network continues its work including at the India AI Impact Summit.
Implications
MonitorAustralian AISI and DISR teams may want to monitor the network's developing consensus documents as they could inform future Australian AI evaluation and assurance frameworks.
ConsiderAgencies involved in AI procurement or model assurance could consider reviewing CAISI's draft Best Practices for Automated Benchmark Evaluations while it remains open for public comment.
Implications are AI-generated. Starting points, not advice.
A Nature Medicine study from the Oxford Internet Institute, involving nearly 1,300 participants, found that LLMs provided no meaningful improvement over traditional search engines for medical advice and introduced risks through inaccurate, inconsistent, and hard-to-evaluate outputs. Users struggled to know what information to provide, and models gave highly variable answers to slight question variations. Critically, the study demonstrates that standard benchmark evaluations fail to capture real-world performance - a finding with broad implications for how governments and regulators assess AI system safety before deployment in high-stakes domains.
Implications
ConsiderAgencies overseeing or procuring AI tools for health or human services contexts could consider whether their evaluation frameworks account for real-user variability, not just benchmark performance.
MonitorAI governance teams may want to monitor whether this study influences TGA, AISI, or health department guidance on LLM deployment in clinical or consumer health settings.
Implications are AI-generated. Starting points, not advice.
A new Alan Turing Institute report warns that the UK must take urgent action to address AI-generated information threats that emerge or intensify following crisis events. The report appears to focus on the intersection of AI capabilities and information integrity during high-stakes moments such as disasters, conflicts, or public health emergencies. The extracted text is heavily truncated, limiting assessment of the specific recommendations or policy mechanisms proposed. Australian agencies with crisis communications, emergency management, or public information responsibilities may find the full report relevant to their risk planning.
Implications
MonitorAgencies with crisis communications or emergency management mandates may want to monitor the full Turing Institute report for transferable recommendations on AI misinformation risk.
ConsiderPolicy teams working on AI risk frameworks could consider whether AI-amplified misinformation during crisis events is adequately addressed in existing agency risk registers.
Implications are AI-generated. Starting points, not advice.
SafetyBench, developed by Zhang et al. (2023), is a bilingual (English and Chinese) benchmark for assessing the safety of large language models across seven categories: offensiveness, unfairness and bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It uses 11,435 multiple-choice questions to evaluate more than 25 LLMs in zero-shot and few-shot settings. The MIT AI Risk Repository blog post spotlights it as the 28th framework in its curated collection, with no new analysis added beyond the original paper.
Implications
ConsiderAgencies developing AI evaluation criteria or procurement specifications could consider SafetyBench's seven safety categories as a reference taxonomy for content-risk assessment.
MonitorAI governance teams may want to monitor the MIT AI Risk Repository's broader framework series as a consolidated reference for emerging AI risk categorisation approaches.
Implications are AI-generated. Starting points, not advice.
The MIT AI Risk Repository has spotlighted a 2023 academic paper by Sun et al. proposing a safety assessment framework for Chinese large language models. The framework includes a taxonomy of eight harm scenario types (covering insult, discrimination, crime, sensitive topics, physical and mental harm, privacy, and ethics) and six adversarial instruction attack types. The authors benchmarked 15 LLMs using this taxonomy and produced a safety leaderboard. While the paper focuses on Chinese-language models, the authors note the taxonomy is adaptable to other languages and contexts.
Implications
MonitorAPS teams developing AI risk taxonomies or safety evaluation frameworks may want to note this taxonomy as one reference point among others, particularly for adversarial prompt attack categories.
Implications are AI-generated. Starting points, not advice.