Weekly Digest 9 Feb 2026

This week at a glance

This week's digest centres on a theme that runs across multiple items: the limits of current AI evaluation methods and what that means for agencies making deployment decisions. Australia's participation in the International Network for Advanced AI Measurement, Evaluation, and Science brings direct relevance, with the network publishing preliminary consensus on automated evaluation practices at a moment when research from Oxford is challenging the reliability of standard benchmarks in real-world, high-stakes settings — findings that should prompt reflection for any agency assessing AI tools for public-facing or sensitive functions. Practitioners with responsibilities in crisis communications or emergency management will also want to note the Alan Turing Institute's call for urgent action on AI-generated information threats during crisis events, even accounting for the report's limited publicly available detail. Rounding out the week, the MIT AI Risk Repository has spotlighted two safety benchmarking frameworks that, while primarily academic in origin, offer reference points for teams building or reviewing AI risk taxonomies.

Headlines

Standards · International Network for Advanced AI Measurement, Evaluation, and Science Publishes Consensus Areas on Practices for Automated Evaluations
Risk · New study warns of risks in AI chatbots giving medical advice

primary source commentary

Standards & Frameworks1 item

NIST – AI News (topic 2753736)(Multi) 13 Feb 2026

International Network for Advanced AI Measurement, Evaluation, and Science Publishes Consensus Areas on Practices for Automated Evaluations

The International Network for Advanced AI Measurement, Evaluation, and Science - a body founded by NIST's CAISI in November 2024 and comprising ten government members including Australia - has published consensus areas on practices and open questions for automated AI evaluation. The preliminary consensus emerged from a December 2025 workshop held alongside NeurIPS in San Diego, and builds on CAISI's draft Best Practices for Automated Benchmark Evaluations. The Network's outputs are intended to strengthen the scientific underpinnings of AI capability measurement and contribute to voluntary international standards. Ongoing discussions will continue at the India AI Impact Summit.

Key points

A ten-country network including Australia published consensus practices for automated AI evaluation and measurement.
Australia is a founding member of this NIST-led international body, giving APS bodies direct insight into emerging global evaluation norms.
Preliminary consensus draws on CAISI's draft Best Practices for Automated Benchmark Evaluations, currently open for public comment.

Implications

Monitor AISI and DISR policy teams may want to monitor the Network's published consensus areas for alignment with Australia's domestic AI assurance and evaluation approaches.
Consider Agencies developing AI evaluation or procurement frameworks could consider whether the Network's emerging practices warrant incorporation into internal guidance or assessment criteria.

View item Original source ↗

Risk, Assurance & Ethics4 items

Oxford Internet Institute – News(UK) 9 Feb 2026

New study warns of risks in AI chatbots giving medical advice

A Nature Medicine study from the Oxford Internet Institute and University of Oxford, involving nearly 1,300 participants, found that LLMs provided no measurable improvement over traditional search engines or personal judgment for medical decision-making. Users struggled to provide the right inputs, received inconsistent answers, and could not distinguish good advice from poor advice within mixed responses. Critically, models that performed well on standardised benchmarks failed in real-user interactions, with researchers calling for clinical-trial-style testing of AI systems before public deployment. The findings reinforce concerns about the gap between AI evaluation methods and real-world performance in high-stakes domains.

Key points

A randomised trial of 1,298 participants found LLMs performed no better than search engines for medical decision-making.
LLM benchmark scores failed to predict real-world performance, raising questions about reliance on standardised evaluation methods.
UK-based research with no immediate Australian regulatory parallel, though findings are relevant to health AI risk assessment globally.

Implications

Consider Agencies developing or procuring AI tools for citizen-facing or high-stakes internal use could consider whether current evaluation methods adequately capture real-user interaction risks, not just benchmark performance.
Monitor Health and human services agencies may want to monitor emerging evidence on LLM reliability in sensitive domains as AI health tools become more prevalent in Australian contexts.

View item Original source ↗

Alan Turing Institute – News(UK) 11 Feb 2026

New report calls for urgent action to tackle AI information threats following crisis events

The Alan Turing Institute has published a report calling for urgent UK action to address AI-driven information threats in the context of crisis events. The report appears to focus on how AI tools can amplify misinformation and disinformation during high-stakes, time-pressured situations such as natural disasters or public emergencies. The extracted text is limited, so the specific recommendations and their scope cannot be fully assessed, but the framing is relevant to any government managing public communications during crises.

Key points

Alan Turing Institute report warns the UK must act urgently on AI-driven information threats during crisis events.
Focus is on AI-amplified misinformation and disinformation risks in high-stress, time-sensitive contexts like disasters or emergencies.
Limited extracted text available; APS relevance depends on recommendations - worth monitoring rather than acting on.

Implications

Monitor Agencies with crisis communications or emergency management responsibilities may want to monitor this report for transferable frameworks on AI-driven information threat response.

View item Original source ↗

MIT AI Risk Repository – Blog(Global) 13 Feb 2026

SafetyBench: Evaluating the Safety of Large Language Models

The MIT AI Risk Repository's blog highlights SafetyBench, a 2023 bilingual (English/Chinese) benchmark for evaluating LLM safety across seven categories: offensiveness, bias, physical health, mental health, illegal activities, ethics and morality, and privacy. It uses 11,435 multiple-choice questions to assess over 25 models in zero-shot and few-shot settings. The blog entry is a brief summary of the underlying arXiv paper rather than new analysis, and is one of a series spotlighting frameworks catalogued in the Repository.

Key points

SafetyBench is a bilingual benchmark assessing LLM safety across 7 risk categories using 11,435 multiple-choice questions.
The MIT AI Risk Repository spotlights this as one of 28 frameworks cataloguing AI risks - useful for comparative evaluation work.
A 2023 academic paper; this blog post adds no new findings beyond summarising the original arXiv publication.

Implications

Monitor Agencies developing AI procurement or evaluation criteria may want to monitor the MIT AI Risk Repository's framework catalogue as a reference collection for structured risk taxonomies.

View item Original source ↗

MIT AI Risk Repository – Blog(Global) 9 Feb 2026

Safety Assessment of Chinese Large Language Models

The MIT AI Risk Repository's blog spotlights a 2023 paper by Sun et al. proposing a safety assessment framework for Chinese large language models. The framework comprises a taxonomy of 8 harmful content scenarios (including insult, discrimination, physical harm, and privacy exposure) and 6 adversarial instruction attack types (such as goal hijacking and role-play misuse), along with a benchmark and safety leaderboard assessing 15 LLMs. While developed for Chinese-language models, the authors note the taxonomy could scale to other languages and model families. The MIT blog entry is a summary rather than original research.

Key points

MIT AI Risk Repository spotlights a 2023 safety taxonomy for Chinese LLMs covering 8 harm scenarios and 6 adversarial attack types.
The taxonomy claims scalability beyond Chinese-language models, making it potentially relevant to broader LLM safety evaluation work.
This is a blog summary of a 2023 academic paper - useful reference material, not new guidance or policy.

Implications

Monitor Agencies developing AI risk assessment frameworks or procurement criteria may want to note this taxonomy as one of several available reference structures for categorising LLM safety risks.
Consider Policy teams could assess whether the 8-scenario harm taxonomy and adversarial attack categories map usefully onto Australia's responsible AI guidance or agency-level risk registers.

View item Original source ↗

Implications are AI-generated. Starting points, not advice — see methodology for how they're framed.