Can AI generate safe anaesthesia plans? A comparative evaluation of three large language models on 100 synthetic cases

Authors: Audrey Jarrassier et al
Anaesthesia Critical Care & Pain Medicine Volume 45, Issue 4 July 2026

Highlights

  • Large language models tested on 100 synthetic preoperative cases.
  • ChatGPT showed highest completeness and guideline adherence.
  • Mistral produced significantly more unsafe recommendations.
  • All models showed increased errors with higher American Society of Anesthesiologists (ASA) physical status complexity.
  • LLMs useful for routine cases but unreliable for high-risk patients.

Background

Preoperative anaesthetic consultation is essential for perioperative care, involving risk assessment, treatment optimisation, and planning of anaesthetic strategies according to established guidelines. Large language models (LLMs) could offer decision-support in this setting, but their autonomous capability to generate comprehensive, guideline-based anaesthetic plans remains unassessed in France and uncertain worldwide.

Methods

In this simulation study, 100 synthetic clinical cases spanning various surgical specialties were evaluated. Three AI models—ChatGPT, Mistral, and a domain-specific LLM (Dougall GPT)—were prompted to generate preoperative anaesthetic assessment and plans based on French guidelines. Outputs were compared with expert anaesthesiologist reference plans using structured expert scoring across multiple domains, including guideline adherence and clinical safety.

Results

Among 1200 evaluated data fields across 100 cases, ChatGPT showed the highest overall guideline conformity, measured using a 0–4 expert-derived ordinal scale per domain (4 = guideline-concordant). ChatGPT provided the most complete outputs (98% of requested items) and achieved the highest median agreement scores in seven of the 12 anaesthesia domains. Dougall GPT performed moderately, whereas Mistral LeChat showed lower conformity and the highest proportion of unsafe or potentially unsafe outputs (scores ≤2).

Conclusions

Current LLMs demonstrate encouraging potential to support preoperative anaesthetic planning for routine cases. However, their reliability remains insufficient for high-risk or complex patients without further fine-tuning and safety controls. These findings underscore both the potential and the current limitations of AI in perioperative decision support.

Leave a Reply

Your email address will not be published. Required fields are marked *