CAISI Evaluation of DeepSeek V4 Pro

NIST – AI News (topic 2753736)(US) 1 May 2026 62

Independent government evaluation reveals a gap between vendor-reported and independently verified AI capability - directly relevant to how APS agencies assess AI procurement claims.

Key points

CAISI's April 2026 independent evaluation found DeepSeek V4 Pro lags US frontier models by approximately 8 months.
DeepSeek's self-reported benchmarks overstate its capability relative to CAISI's non-public, held-out evaluations.
DeepSeek V4 is more cost-efficient than comparable US models on most benchmarks - a procurement-relevant finding.

Implications for Australian agencies

Consider APS agencies evaluating AI model procurement or pilots could consider applying independent or held-out benchmarks rather than relying on vendor self-reported capability claims.
Monitor Policy and security teams may want to monitor CAISI's ongoing evaluations for signal on PRC model capabilities, particularly in cyber and software engineering domains relevant to government use.

Implications are AI-generated. Starting points, not advice — see methodology for how they're framed.

View original source

Appeared in: Weekly digest, 27 April 2026