Evaluating Large Language Model Performance in Generating Clinically Relevant Intensive Care Unit Discharge Summaries

Authors: Mudumbai SC et al.

A & A Practice. 19(9):e02057, September 2025.

Summary
This study evaluated the clinical quality of large language model (LLM)–generated intensive care unit (ICU) discharge summaries compared with traditional physician-authored summaries. ICU discharge summaries are time-intensive and cognitively demanding, requiring accurate synthesis of complex, longitudinal clinical data. The authors examined whether current LLMs can meaningfully support or automate this task.

Ten ICU patient cases were randomly selected from the MIMIC-III database, each containing exactly 20 physician notes (including admission, progress, and consultation notes). A Bidirectional and Auto-Regressive Transformer (BART) model was used to generate individual note summaries, which were then merged into a single ICU discharge summary per patient. Four experienced intensivists independently evaluated the LLM-generated summaries using a 5-point Likert scale across six domains: coherence, consistency, fluency, relevance, utility, and overall quality relative to human-authored summaries.

LLM-generated summaries performed well in language-focused domains. Median scores were high for coherence and fluency, indicating that the summaries were readable, logically structured, and linguistically smooth. However, performance declined in clinically critical domains. Scores for consistency, relevance, and utility were moderate, reflecting frequent omissions of patient-specific details, incomplete clinical reasoning, and limited usefulness for downstream care. Overall quality compared with human-authored summaries scored lowest, suggesting that despite readability, LLM summaries lacked the depth, nuance, and prioritization expected in ICU discharge documentation.

Inter-rater reliability among intensivists was moderate for coherence, consistency, and fluency, but lower for relevance and utility, highlighting variability in clinician expectations regarding clinical usefulness. Reviewers consistently noted that LLM summaries tended to generalize, omit key events, and underrepresent complications, medications, and care decisions critical for continuity of care.

The authors conclude that while current LLMs can generate fluent ICU discharge summaries, they are not yet clinically equivalent to human-authored documentation. Significant improvements in ICU-specific fine-tuning, structured data integration, and domain-aware reasoning are required before LLMs can safely and reliably support ICU discharge workflows.

What You Should Know
LLMs generate readable and coherent ICU discharge summaries but often miss critical clinical details.
Clinical utility and relevance remain inferior to physician-authored summaries.
Language quality alone is insufficient for safe ICU discharge documentation.
ICU-specific training and integration of structured clinical data are essential for future improvement.
LLMs may serve as drafting aids rather than autonomous documentation tools at present.

Thank you for allowing us to review and summarize this article from A & A Practice.

Leave a Reply

Your email address will not be published. Required fields are marked *