Weekly Digest 23 Mar 2026

This week at a glance

This week's digest centres on AI assurance and evaluation practice, with several items directly relevant to APS agencies building or procuring AI capabilities. A practitioner guide from Australian quality engineering firm KJR sets out a structured approach to LLM testing for regulated sectors, including government, with particular attention to accountability obligations that cannot be passed to model vendors. On the evaluation standards front, NIST's CAISI has partnered with OpenMined to develop privacy-preserving AI assessment methods for settings where data or models cannot be openly shared — a practical constraint familiar to many federal agencies. Rounding out the week, UK AI Security Institute findings on frontier models completing multi-step cyberattacks in controlled exercises add weight to the case for robust pre-deployment security testing, while an OECD commentary questions whether current stakeholder engagement practices in AI governance are substantive enough to meet trustworthiness requirements across a system's full lifecycle.

Headlines

AU Gov · What Is LLM Testing? A Complete Guide for Enterprises
Standards · To be truly participative, stakeholder involvement should follow an AI system’s entire lifecycle
Tech · Import AI 450: China's electronic warfare model; traumatized LLMs; and a scaling law for cyberattacks

primary source commentary

Australian Government2 items

KJR – Insights(AU) 24 Mar 2026

What Is LLM Testing? A Complete Guide for Enterprises

KJR, an Australian testing and assurance consultancy, has published a guide to LLM testing aimed at enterprise QA and testing leaders. The guide covers why traditional software testing approaches are insufficient for probabilistic LLM systems and proposes a four-phase enterprise framework spanning risk identification, test strategy design, execution, and governance reporting. Core testing domains include functional validation, security and adversarial testing, bias assessment, and ongoing drift detection. Australian government is cited alongside financial services, healthcare, and utilities as a sector where LLM testing is a governance requirement rather than optional. The piece references a Microsoft/KJR case study involving an Azure OpenAI-powered RAG platform for the public sector.

Key points

KJR outlines a structured enterprise framework for testing and assuring LLM-powered systems across regulated sectors.
Australian government agencies are explicitly named as a regulated sector where LLM testing is a governance requirement.
Item is vendor-authored marketing content from a testing consultancy - practical but commercial in framing.

Implications

Consider Agencies developing or procuring LLM-based tools may want to consider whether their AI assurance processes address the testing domains outlined here - particularly adversarial testing and drift detection.
Monitor AI governance and risk teams may want to monitor emerging LLM testing frameworks and vendor methodologies as practical input to assurance requirements in procurement and delivery.

View item Original source ↗

NIST – AI News (topic 2753736)(US) 27 Mar 2026

Announcement: CAISI signs CRADA with OpenMined to Enable Secure AI Evaluations

NIST's Center for AI Standards and Innovation (CAISI) has signed a Cooperative Research and Development Agreement (CRADA) with OpenMined, a non-profit developing open-source secure computation tools. The collaboration will research privacy-preserving approaches to AI evaluation — enabling measurement of AI systems even when underlying data, models, or benchmarks are confidential due to IP, data protection, or national security constraints. It will leverage OpenMined's PySyft infrastructure and is expected to produce voluntary standards, best practices, and recommendations for AI practitioners on effective measurement, including for workforce and productivity impact assessment.

Key points

NIST CAISI has signed a CRADA with OpenMined to research privacy-preserving methods for AI evaluations.
The collaboration aims to enable rigorous AI measurement when data, models, or benchmarks must remain confidential.
Outputs will inform voluntary standards and best practices for AI evaluation - relevant when Australian AISI considers evaluation frameworks.

Implications

Monitor Australia's AISI and DISR policy teams may want to monitor outputs from this collaboration as they could inform Australian approaches to AI evaluation and measurement standards.
Consider Agencies developing AI procurement or assurance frameworks could consider how privacy-preserving evaluation techniques might address confidentiality barriers when assessing vendor AI systems.

View item Original source ↗

Standards & Frameworks1 item

OECD AI Wonk Blog(Global) 24 Mar 2026

To be truly participative, stakeholder involvement should follow an AI system’s entire lifecycle

An OECD AI Wonk Blog post argues that participatory AI governance is routinely reduced to one-off consultation, and that genuine participation requires governance infrastructure, community authority, and oversight mechanisms that persist across an AI system's entire lifecycle. The piece draws on OECD AI principles and positions lifecycle-wide stakeholder involvement as essential to trustworthy AI. The extracted text is limited, so the full analytical depth is not available here; practitioners interested in participatory governance models should read the source.

Key points

OECD argues participatory AI must extend beyond consultation to cover an AI system's full lifecycle.
Governance infrastructure and community authority are identified as prerequisites for meaningful stakeholder involvement.
Extracted text is brief; full argument detail requires reading the source directly.

Implications

Consider Agencies designing AI governance frameworks could consider whether their stakeholder engagement processes extend beyond initial consultation to cover deployment, monitoring, and decommissioning phases.
Monitor Policy teams working on the APS AI Plan or responsible AI guidance may want to monitor OECD output on participatory governance as an emerging international standard reference.

View item Original source ↗

Technical Developments1 item

Import AI – Substack (Jack Clark)(Multi) 23 Mar 2026

Import AI 450: China's electronic warfare model; traumatized LLMs; and a scaling law for cyberattacks

This issue of Import AI covers four research items. Most significant for APS readers: the UK AI Security Institute has published results from cyber range testing showing frontier AI agents are improving rapidly at end-to-end multi-step cyberattacks, with performance nearly doubling across model generations and scaling further with inference compute. Separately, Chinese researchers including those affiliated with the National University of Defense Technology have released MERLIN, an AI model trained on a 100,000-sample electromagnetic signal dataset for electronic warfare tasks including jamming strategy and signal classification. Two other items - Google DeepMind's ten-dimension cognitive taxonomy for assessing AGI progress, and research diagnosing 'distress-like' response patterns in Google's Gemma models - are primarily of technical interest with limited immediate APS governance implications.

Key points

UK AISI finds successive AI model generations improve measurably at multi-step autonomous cyberattacks, with a clear scaling law.
Chinese military-affiliated researchers released MERLIN, an AI model and dataset targeting electronic warfare signal reasoning.
Newsletter also covers Google DeepMind's AGI cognitive taxonomy and LLM 'distress' personality research - lower APS relevance.

Implications

Monitor Cyber security and AI risk teams may want to monitor UK AISI's cyber range evaluation methodology and results as Australian AISI considers analogous threat assessments.
Monitor Defence and national security policy teams may want to monitor MERLIN and the broader Chinese military AI research trajectory for electronic warfare capability implications.

View item Original source ↗

Implications are AI-generated. Starting points, not advice — see methodology for how they're framed.