The American Society of Anesthesiologists holds the anesthesiologist responsible for optimizing a patient’s preoperative medications, which includes enabling the patient to understand and adhere to preoperative instructions.  Based on the National Assessment of Adult Literacy, nearly 40% of adults in the United States have basic or below basic health literacy. Accordingly, it is recommended that the information given to patients be written at or below the 6th-grade level. Yet studies have shown that most written materials for patients are too difficult for the average patient’s reading skills.  In perioperative medicine, discrepancy between written instructions and patient’s health literacy has been identified as a contributor to patients’ nonadherence to written instructions, which can result in delay or cancellation of surgeries and adverse patient outcomes, worsening health care disparity.  The recent advancements in large language models, such as the Generative Pretrained Transformer (GPT) chatbot systems developed by OpenAI (San Francisco, California), present an opportunity to precisely tailor text information to specified reader traits and facilitate the patient’s comprehension. However, the utility and risks of this approach for preoperative patient communications are yet unknown. In this work, we investigate the potential use of language models to enhance the readability of preoperative patient instructions while maintaining accuracy and comprehensiveness.

We compared standard English-language preoperative clinic patient instructions at a major academic medical center to those enhanced by language models. Five synthetic preoperative clinic visits were created in the electronic health record (EHR) to generate baseline texts. After-visit patient instructions were generated using the hospital’s standard templates, each incorporating different permutations of available boilerplate language. These were then downloaded in plaintext and manually scrubbed of any identifiable data, including staff names and hospital contact information. Instructions were then presented to GPT-3.5 and GPT-4 via the OpenAI inference endpoint with a prompt to improve readability to a target 6th-grade reading level. Requests used a temperature of 0.2, an independent top-p, and a unique seed, thus generating 25 replications per document for a total of 250 GPT-generated documents. Experiments were developed in Python (v3.11; USA), and statistical analysis was performed in Textstat (v0.7.3) and SciPy (v1.12). The original documents, analysis, and results are available on GitHub repository (https://github.com/stanfordaimlab/anesthesia-literacy; accessed May 3, 2024).

The primary outcome measure for the readability analysis was the U.S. school grade level required for adequate comprehension. This was estimated using the Flesch-Kincaid Grade Level scoring system, a well-validated readability measure that combines sentence density and average syllable count into a weighted sum. Average readability scores across patient scenarios for the GPTs relative to the baseline texts were calculated with ANOVA and Tukey’s honest significant difference for pairwise comparisons. To assess the accuracy and completeness of the modified text, specific, unambiguous instructions were selected from five domains (Supplemental Digital Content 2, https://links.lww.com/ALN/D611). Each document was evaluated by one of two anesthesiologists (HH, AG) to determine if these instructions were fully included, partially included, or completely absent and assigned scores of 1, 0.5, or 0, respectively. In this manner, a total of 1,250 individual instructions were evaluated.

This study included 255 patient instruction scripts – generated using a standard hospital template, GPT-3.5, and GPT-4 for five patient scenarios – for readability analysis. As visualized in figure 1, the readability grade level for the patient instructions generated by GPT-4 was the lowest, with a mean of 5.0 (±0.76) grade level. Outputs from GPT-3.5 had a mean readability of 10th-grade level (±0.44), which is minimal improvement from baseline texts whose mean readability score was 10 (±0.37). Table 1 displays a pairwise comparison of these scores. The patient instructions generated by GPT-4 consistently had readability scores under the 6th-grade level. They were significantly less complex across all patient scenarios relative to the standard hospital text (P < 0.01) and GPT-3.5 (P < 0.01). Evaluation of accuracy and completeness demonstrated no inaccurate, missing, or partially complete components.

Table 1.

Pairwise Comparisons of Flesch Kincaid Grade Level Means of Original, GPT-3.5, and GPT-4 Versions Using Tukey’s Honest Significant Difference Test

Pairwise Comparisons of Flesch Kincaid Grade Level Means of Original, GPT-3.5, and GPT-4 Versions Using Tukey’s Honest Significant Difference Test
Fig 1.
Box and Whisker Diagram of Flesch Kincaid Grade Level Results in Original, GPT-3.5, and GPT-4 Versions. The horizontal line within each green box denotes the mean readability score value in the respective group; boxes extend from the 25th to the 75th percentile of each group’s distribution of values. The vertical lines above and below each box denote adjacent values, or the extreme values within the 1.5 interquartile range of the 25th and 75th percentile of each group. Hollowed dots denote observations outside the range of adjacent values.

Box and Whisker Diagram of Flesch Kincaid Grade Level Results in Original, GPT-3.5, and GPT-4 Versions. The horizontal line within each green box denotes the mean readability score value in the respective group; boxes extend from the 25th to the 75th percentile of each group’s distribution of values. The vertical lines above and below each box denote adjacent values, or the extreme values within the 1.5 interquartile range of the 25th and 75th percentile of each group. Hollowed dots denote observations outside the range of adjacent values.

This work suggests that language models can be successfully used to improve the readability of preoperative patient instructions to a 6th-grade reading level or below, achieving the recommended level of complexity for patient-facing written material without compromising accuracy or comprehensiveness. The readability of standard preoperative patient instruction templates required a reading level of 9th-grade or above, which is too complex for many American patients. Interestingly, GPT-4 performed noticeably better than GPT-3.5 in improving readability using the same prompts and texts. GPT-4 consistently generated instructions meeting target reading level, whereas GPT-3.5 failed to improve readability beyond one whole grade level from the baseline text in all given patient scenarios. Such a drastic difference in performance highlights the importance of the specific large language model employed. This study is not without limitations. First, the standard patient instructions used as baseline texts were from a single institution. While patient instruction templates vary across hospitals, the opportunity to use language models to improve the readability of written material is relevant to all institutions with English-speaking patients. Subsequent work should apply a similar methodology to other languages. Second, our evaluation of accuracy was limited to manual review of five specific components in each of the 250 generated documents. While no errors were detected in these domains, we cannot entirely rule out errors in other sections. Lastly, this study was also limited to Open AI’s language models. We selected the most recent versions of GPT for this study based on their popularity and ease of access. It is important to note that neither GPT system in its publicly available form complies with the Health Insurance Portability and Accountability Act (HIPAA).  Future studies should test the utility of other language models, ideally those with open-source architecture that can be incorporated into locally-run clinical informatics tools and clinicians’ workflows without the need to de-identify protected health information. EHR integration and HIPAA-compliant data security will be critical in applying large language model capability to enhance physician-patient communication toward personalized perioperative medicine.