New study warns of risks in AI chatbots giving medical advice
Rigorous empirical evidence that LLM benchmarks fail to predict real-world medical safety - directly relevant to AI risk assessment in health and human services contexts.
Key points
- A randomised trial of 1,298 participants found LLMs no better than search engines for medical decision-making.
- Benchmark test performance does not reliably predict real-world safety - regulators and agencies should note this gap.
- Australian health agencies and AI governance teams considering LLM-assisted health tools have directly applicable evidence here.
Summary
A Nature Medicine study from the Oxford Internet Institute, involving nearly 1,300 participants, found that LLMs provided no meaningful improvement over traditional search engines for medical advice and introduced risks through inaccurate, inconsistent, and hard-to-evaluate outputs. Users struggled to know what information to provide, and models gave highly variable answers to slight question variations. Critically, the study demonstrates that standard benchmark evaluations fail to capture real-world performance - a finding with broad implications for how governments and regulators assess AI system safety before deployment in high-stakes domains.
Implications for Australian agencies
- Consider Agencies overseeing or procuring AI tools for health or human services contexts could consider whether their evaluation frameworks account for real-user variability, not just benchmark performance.
- Monitor AI governance teams may want to monitor whether this study influences TGA, AISI, or health department guidance on LLM deployment in clinical or consumer health settings.
Implications are AI-generated. Starting points, not advice.
"New study warns of risks in AI chatbots giving medical advice" Source: Oxford Internet Institute – News Published: 9 February 2026 URL: https://www.oii.ox.ac.uk/news-events/new-study-warns-of-risks-in-ai-chatbots-giving-medical-advice/ A Nature Medicine study from the Oxford Internet Institute, involving nearly 1,300 participants, found that LLMs provided no meaningful improvement over traditional search engines for medical advice and introduced risks through inaccurate, inconsistent, and hard-to-evaluate outputs. Users struggled to know what information to provide, and models gave highly variable answers to slight question variations. Critically, the study demonstrates that standard benchmark evaluations fail to capture real-world performance - a finding with broad implications for how governments and regulators assess AI system safety before deployment in high-stakes domains. Implications for Australian agencies: - [Consider] Agencies overseeing or procuring AI tools for health or human services contexts could consider whether their evaluation frameworks account for real-user variability, not just benchmark performance. - [Monitor] AI governance teams may want to monitor whether this study influences TGA, AISI, or health department guidance on LLM deployment in clinical or consumer health settings. Retrieved from SIMS, 18 May 2026.