CAISI Evaluation of DeepSeek V4 Pro
Independent US government evaluation of a leading PRC AI model challenges vendor self-reporting and signals the value of third-party capability assessment — a practice Australian agencies may wish to reference.
Key points
- CAISI's independent evaluation finds DeepSeek V4 Pro lags US frontier AI models by approximately 8 months.
- DeepSeek's self-reported benchmarks overstate its capability relative to CAISI's non-public, held-out evaluations.
- DeepSeek V4 is more cost-efficient than comparable US models on most benchmarks, raising procurement considerations.
Summary
NIST's Center for AI Standards and Innovation (CAISI) conducted an independent evaluation of DeepSeek V4 Pro in April 2026, finding it to be the most capable PRC AI model assessed to date but trailing the US frontier by approximately 8 months when measured against non-public, held-out benchmarks. Notably, DeepSeek's own self-reported evaluations present a more favourable picture, suggesting rough parity with frontier US models — a discrepancy CAISI attributes to benchmark selection. DeepSeek V4 was more cost-efficient than the comparable US reference model (GPT-5.4 mini) on five of seven benchmarks. The evaluation demonstrates an emerging US government practice of independent, rigorous model assessment using proprietary benchmarks to resist contamination and gaming.
Implications for Australian agencies
- Monitor Australian AISI and DISR policy teams may want to monitor CAISI's evolving evaluation methodology, particularly its use of held-out benchmarks, as a potential model for Australian capability assessment practices.
- Consider Agencies assessing AI procurement options involving PRC-origin models could consider how independent third-party evaluations differ from vendor self-reported benchmarks when forming risk assessments.
Implications are AI-generated. Starting points, not advice.
"CAISI Evaluation of DeepSeek V4 Pro" Source: NIST – AI News (topic 2753736) Published: 1 May 2026 URL: https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro NIST's Center for AI Standards and Innovation (CAISI) conducted an independent evaluation of DeepSeek V4 Pro in April 2026, finding it to be the most capable PRC AI model assessed to date but trailing the US frontier by approximately 8 months when measured against non-public, held-out benchmarks. Notably, DeepSeek's own self-reported evaluations present a more favourable picture, suggesting rough parity with frontier US models — a discrepancy CAISI attributes to benchmark selection. DeepSeek V4 was more cost-efficient than the comparable US reference model (GPT-5.4 mini) on five of seven benchmarks. The evaluation demonstrates an emerging US government practice of independent, rigorous model assessment using proprietary benchmarks to resist contamination and gaming. Implications for Australian agencies: - [Monitor] Australian AISI and DISR policy teams may want to monitor CAISI's evolving evaluation methodology, particularly its use of held-out benchmarks, as a potential model for Australian capability assessment practices. - [Consider] Agencies assessing AI procurement options involving PRC-origin models could consider how independent third-party evaluations differ from vendor self-reported benchmarks when forming risk assessments. Retrieved from SIMS, 18 May 2026.