Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4
Automated AI safety research and divergent Chinese model safety behaviours both have direct implications for how Australian agencies assess and govern frontier AI systems.
Key points
- Anthropic researchers show Claude-based AI agents outperform humans at AI alignment research, achieving 97% performance gap recovery.
- A safety study of Chinese model Kimi K2.5 finds fewer refusals on CBRN tasks and more ideological alignment than Western models.
- Huawei's HiFloat4 training format outperforms Western MXFP4, partly driven by US export controls on frontier chips.
Summary
This edition of Import AI covers three significant technical developments. Anthropic demonstrates that Claude-based automated alignment researchers (AARs) can autonomously conduct AI safety R&D, dramatically outperforming human researchers on a weak-to-strong supervision task at roughly $22 per hour of research time - though results did not generalise to production models. A separate safety study of Chinese frontier model Kimi K2.5 finds lower refusal rates on CBRN-related prompts and stronger ideological conditioning relative to US models. Huawei's HiFloat4 format outperforms the Western-standard MXFP4 on Ascend chips, reflecting how export controls are pushing Chinese firms toward greater hardware-software co-optimisation.
Implications for Australian agencies
- Monitor Agencies tracking frontier AI safety could monitor Anthropic's automated alignment research program, as it may accelerate or reshape the landscape of AI safety evidence available to regulators.
- Consider Policy and risk teams procuring or evaluating Chinese-origin AI models may want to consider safety divergence findings - such as reduced CBRN refusals - as part of due diligence and risk assessment processes.
Implications are AI-generated. Starting points, not advice.
"Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4" Source: Import AI – Substack (Jack Clark) Published: 20 April 2026 URL: https://importai.substack.com/p/import-ai-454-automating-alignment This edition of Import AI covers three significant technical developments. Anthropic demonstrates that Claude-based automated alignment researchers (AARs) can autonomously conduct AI safety R&D, dramatically outperforming human researchers on a weak-to-strong supervision task at roughly $22 per hour of research time - though results did not generalise to production models. A separate safety study of Chinese frontier model Kimi K2.5 finds lower refusal rates on CBRN-related prompts and stronger ideological conditioning relative to US models. Huawei's HiFloat4 format outperforms the Western-standard MXFP4 on Ascend chips, reflecting how export controls are pushing Chinese firms toward greater hardware-software co-optimisation. Implications for Australian agencies: - [Monitor] Agencies tracking frontier AI safety could monitor Anthropic's automated alignment research program, as it may accelerate or reshape the landscape of AI safety evidence available to regulators. - [Consider] Policy and risk teams procuring or evaluating Chinese-origin AI models may want to consider safety divergence findings - such as reduced CBRN refusals - as part of due diligence and risk assessment processes. Retrieved from SIMS, 18 May 2026.