“Vital sign data artifacts not only are a problem for retrospective observational research but also pose a threat to the accuracy of automated quality measures and pragmatic prospective clinical trial results.”
An essential skill of an experienced anesthesiologist is to quickly determine when a vital sign alarm is valid. When the oscillometric blood pressure cuff fails to detect pulsations during deflation, is that due to the surgeon leaning on the cuff or a sudden drop in cardiac output? If the automated arrhythmia detection alarm sounds, is the patient’s heart fibrillating or is the junior surgical resident applying antiseptic a little too roughly? The anesthesiologist in the room has the benefit of context in these situations, but that context is left behind when we preserve vital sign data for later analysis. Vital sign data artifacts not only are a problem for retrospective observational research but also pose a threat to the accuracy of automated quality measures and pragmatic prospective clinical trial results.
In this issue of Anesthesiology, Maleczek et al. at the Medical University of Vienna (Vienna, Austria) compare old and new methods for vital sign artifact detection using a data set labeled retrospectively for likely artifacts by human experts. The authors found that each vital sign had a different performance profile for artifact detection. This finding suggests that optimal artifact detection requires a different method for each type of vital sign.
The dataset included 106 patients evenly split between the operating room and the intensive care unit (ICU). Each patient’s recordings included five vital signs stored as discrete numerical values: electrocardiogram heart rate, blood pressure, temperature, capnometry, and oxygen saturation. Operating room vital signs were recorded every 15 s, while intensive care unit vital signs were recorded every 15 min. Both invasive and noninvasive blood pressure data were available. The evaluation compared five methods: a set of cutoff values, z-value filtering, interquartile range filtering, local outlier factor, and a popular machine learning technique known as a long short-term memory neural network.
The results found different methods performed best for each major vital sign. In the operating room, the machine learning method performed best for temperature (sensitivity, 76.1%) and heart rate (sensitivity, 39.5%). Capnography artifacts were best identified with the interquartile range method (sensitivity, 72.5%) Invasive mean arterial pressure artifact identification with a simple cutoff approach had the highest sensitivity (74.9%). Oxygen saturation artifacts were found best with the z-value approach (sensitivity, 88.9%).
However, in the ICU the results were quite different. The machine learning method was best for capnography (specificity, 72.6%), heart rate (sensitivity, 33.6%), and invasive mean arterial pressure (sensitivity, 51.5%). The interquartile range method worked best for temperature (sensitivity, 71.9%). The z-value approach worked best for oxygen saturation (sensitivity, 73.9%). The cutoff approach was best for none of the ICU vital signs.
The study had important limitations which are noted by the authors. Artifacts were labeled not by real-time observers, but afterward by experts who determined by experience which data points appeared to be artifactual. For this reason, the artifacts in the study are better described as “suspected artifacts.” Artifacts that result in physiologic implausibility are easier to flag. A good example is an invasive systolic blood pressure in the 300s coincident with a blood sample being drawn from the arterial line. However, the transducer of the same arterial line may be at the wrong height for several minutes, producing incorrect but plausible readings.
Maleczek et al. are to be commended for comparing machine learning to traditional techniques. In recent research, artificial intelligence methods are often the only ones tried even though simpler approaches are sufficient. Another notable feature of this study is that it included an equal number of ICU encounters as operating room visits, which is important because the sources of ICU artifact are of a different nature than those in the operating room.
Vital sign artifact filtering methods can affect research results and interpretation regardless of the overall study design. Techniques for handling mean arterial pressure artifacts were reviewed in an article by Pasma et al. They observed that estimates of hypotension prevalence were significantly affected by the artifact handling technique used with a mean arterial pressure threshold of 50 mmHg, although higher mean arterial pressure thresholds and the association between intraoperative hypotension and postoperative myocardial injury were affected to a much smaller degree. However, that study did not have a dataset labeled with suspected artifacts, so actual artifact detection performance could not be measured.
There was a time when some anesthesiologists were critical of the idea that unverified vital signs would automatically file to the medical record. Many anesthesia information systems required that each vital sign be verified by a clinician first. Opinion slowly changed to accept that a warts-and-all anesthesia record was just as defensible as one with “railroad tracks.” In fact, the absence of vital sign data can be a more significant medicolegal issue than artifactual data.5
It is an exciting time in perioperative research because there are new high-quality data sets available to investigators. Two recent arrivals are the Medical Informatics Operating Room Vitals and Events Repository set of surgical electronic health record and waveform data from University of California, Irvine and the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. The Multicenter Perioperative Outcomes Group (MPOG) has a high-quality data set drawn from more than 70 hospitals. MPOG details the artifact handling for blood pressure, ventilation parameters, and other vital signs in a publicly available phenotype library.* It is important that peer reviewed journals continue to require detailed methods that explain exactly how artifacts were identified and filtered.
Due to the trust clinicians and researchers place in automated vital sign capture, data artifacts can affect prospective clinical trials just as much as retrospective studies. Pragmatic clinical trial design encourages researchers to rely on “usual care” data collection as much as possible in order to reduce trial personnel burden and improve the generalizability of results.
The rise of electronic clinical quality measures means that vital sign data will increasingly be used to objectively judge anesthesiologist performance. The ePreop31 measure available from the Anesthesia Quality Institute (Schaumburg, Illinois) and licensed from the Provation Software Group (Minneapolis, Minnesota) determines whether excess intraoperative hypotension occurred by looking at blood pressure data from the anesthesia record. The measure exempts blood pressure values that are outside preset cutoffs or marked as artifact by the clinician. Measures proposed for adoption should be vetted to ensure vital sign artifact does not lead to inaccurate quality results. It is difficult to predict how vital sign artifacts will change quality measure outcomes. For this measure, an artifact filter that removes true hypotension could make performance appear to be better than it really is. For another measure, a poorly designed filter could make the anesthesiology group appear to have worse quality than they actually do.
In the future, one hopes that vital sign equipment will grow less prone to reading and recording artifact as fact. With additional input such as audio/video recording of the anesthesiologists’ perspective, we may be able to determine both the presence and reason for artifacts using a sufficiently advanced artificial intelligence model. Until that time, we need to be aware that retrospective research, prospective research, and quality measure results are all subject to data artifacts in vital sign collection. Research is needed using an accepted standard reference to determine how artifactual data change quality measure results in a positive or negative way.