Anesthesiology September 2024, Vol. 141, A13–A15.
Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg 2024; e241621
Large language models (LLMs) like ChatGPT are widely used, but their use for perioperative care applications is uncertain. This retrospective prognostic study studied six different “prompting strategies” for question-and-answer tasks using GPT-4 Turbo (OpenAI) on 133,500 anesthesia preoperative notes and 226,821 clinical notes from three U.S. academic medical centers. LLMs were asked to predict eight outcomes: inpatient mortality, American Society of Anesthesiologists (ASA) Physical Status, hospital admission, intensive care unit (ICU) admission, unplanned admission, postanesthesia care unit (PACU) phase 1 duration, hospital length of stay, and ICU length of stay. Predictions were validated against labeled data for the outcomes of interest. LLM performance was described using F1 scores, which includes true positive, false positive, and false negative values in a combined metric (ranging from 0 = lowest to 1 = highest performance). Prompting strategies providing more background examples of notes to the LLM (“few-shot”) and those requiring the reasoning behind LLM answers (“chain of thought”) generally performed best. Across outcomes, LLMs performed best at predicting in-hospital mortality (F1 score, 0.86; 95% CI, 0.83 to 0.89), ASA Physical Status (F1 score, 0.50; 95% CI, 0.47 to 0.53), need for hospital admission (F1 score, 0.64; 95% CI, 0.61 to 0.67), need for ICU admission (F1 score, 0.81; 95% CI, 0.78 to 0.83), and unplanned admission (F1 score, 0.61; 95% CI, 0.58 to 0.64). Prediction for duration-specific outcomes such as length of stay and PACU stay was poor.
Take home message: This retrospective analysis suggests that commonly used LLMs (like ChatGPT) may be able to aid in perioperative risk stratification for specific outcomes using customized prompting strategies.
Leave a Reply
You must be logged in to post a comment.