New study warns of risks in AI chatbots giving medical advice

Oxford Internet Institute – News(UK) 9 Feb 2026 62

Benchmark scores do not predict real-world safety in high-stakes settings — a direct challenge to how agencies evaluate AI tools before deployment.

Key points

A randomised trial of 1,298 participants found LLMs performed no better than search engines for medical decision-making.
LLM benchmark scores failed to predict real-world performance, raising questions about reliance on standardised evaluation methods.
UK-based research with no immediate Australian regulatory parallel, though findings are relevant to health AI risk assessment globally.

Consider Agencies developing or procuring AI tools for citizen-facing or high-stakes internal use could consider whether current evaluation methods adequately capture real-user interaction risks, not just benchmark performance.
Monitor Health and human services agencies may want to monitor emerging evidence on LLM reliability in sensitive domains as AI health tools become more prevalent in Australian contexts.

Implications are AI-generated. Starting points, not advice — see methodology for how they're framed.