The external validation of the model shows a compelling area under the receiver operating characteristics curve (AUC) of 0.95 for predicting hypotension 5 min before the event. However, the corresponding receiver operating characteristics curve presented in the index publication’s figure 3 (reprinted and adapted here as fig. 1) looks highly improbable. This receiver operating characteristics curve presents the sensitivity (true predictions of hypotension out of all hypotensive events) and corresponding specificity (true predictions of nonhypotension out of all nonhypotensive events) for all possible thresholds (“alarm limits”) of Hypotension Prediction Index. The receiver operating characteristics curve in the index publication shows that a specific (yet unspecified in the publication) threshold is associated with a specificity of approximately 100% and a sensitivity greater than 60%. When the specificity is 100%, there are no false predictions of hypotension. Therefore, the positive predictive value (true predictions of hypotension out of all predictions of hypotension) is also 100%. This means that Hypotension Prediction Index values above this threshold are always associated with future hypotension. We find it difficult to imagine that it is possible to predict with approximately 100% certainty that hypotension will occur while maintaining a reasonable sensitivity. Herein, based on the data selection described in the index publication and a computer simulation illustrating the consequence of that data selection, we present the hypothesis that the index publication and several reported validation studies of the Hypotension Prediction Index may contain a systemic statistical bias influencing its predictive abilities.
Data Selection May Be Biased
A probable explanation for the reported high specificity can be found in the methods section of the index publication:
“Model Feature Selection and Training:
A hypotensive event was calculated by identifying a section of at least 1-min duration such that all data points in the section showed MAP < 65 mmHg. An event, or positive data point, was chosen as the sample recorded 5, 10, or 15 min before the hypotensive event. A nonhypotensive event was calculated by identifying a 30-min continuous section of data points such that the section was at least 20 min apart from any hypotensive event, and all data points in that section showed MAP > 75 mmHg. A nonevent, or negative data point, was the center point of the nonhypotensive event.”
Thus, a nonhypotensive event is defined as a 30-min section where MAP is above 75 mmHg. The sample used to predict a nonhypotensive event is the center point of this section. Therefore, a sample corresponding to a nonhypotensive event will always have a MAP greater than 75 mmHg, while samples corresponding to hypotensive events can have any MAP (this selection of events and samples is illustrated in the model-development paper’s supplementary fig. 2, reproduced here as fig. 2). Because of this selection, a sample with MAP less than 75 mmHg will always correspond to a future hypotensive event in the training and the test sets. This can explain why it was possible for the Hypotension Prediction Index to predict hypotension with 100% specificity: the Hypotension Prediction Index could achieve 100% specificity by simply reflecting the current MAP value—without 22 additional features. Given that the algorithm is proprietary, we do not know which features are included in the Hypotension Prediction Index model, but we do know that MAP was one of the candidates. If the model training was effective, Hypotension Prediction Index should learn that a current MAP less than 75 mmHg is always associated with future hypotension.
As an analogy, one can imagine excluding all subjects younger than 60 yr from the samples corresponding to nonhypotensive events. Then a prediction model applied to the remaining data could simply predict “hypotension” for all subjects younger than 60 yr and be correct every time.
Given this selection problem, receiver operating characteristics curves for MAP and the Hypotension Prediction Index should both show a skew toward high specificity (as shown in fig. 1). Because MAP monitoring is commonly used to titrate blood pressure treatment, we consider that the added clinical value of Hypotension Prediction Index is the difference between the Hypotension Prediction Index’s and MAP’s ability to predict hypotension. Unfortunately, MAP’s ability to predict future hypotension was not reported in the index publication. Instead, the Hypotension Prediction Index was compared to ΔMAP (the change in MAP during, e.g., 3 min), a comparator that is both an unintuitive and poor predictor of hypotension, as discussed in a recent paper.
The likely existence of a significant selection bias in the development and validation of the Hypotension Prediction Index is the key message of this discussion.
In the following sections, we first use simulated data to visualize the effect of the biased selection, and demonstrate how the selection can result in skewed receiver operating characteristics curves similar to that in figure 1. Second, we address questions that may naturally arise: Does this bias also affect the numerous validation studies in the literature? What about all the other features used in the Hypotension Prediction Index model? How could the Hypotension Prediction Index be validated appropriately? What is the effect of using Hypotension Prediction Index in clinical trials?
Visualization of the Selection Problem
To visualize how the selection problem can artificially enhance the current MAP’s ability to predict hypotension (and thereby likely also the Hypotension Prediction Index’s ability), we performed a simple simulation (fig. 3). The simulation is not an attempt to produce realistic data but only serves to visualize how the selection problem can result in a “skewed” receiver operating characteristics curve with very high specificity.
We generated normally distributed data representing the current MAP values available for predicting future hypotension. The MAP values corresponding to hypotensive events have a lower mean than the MAP values corresponding to nonhypotensive events, making current MAP a modest predictor of hypotension (fig. 3A). We then impose a selection that removes current MAP values less than 75 mmHg for nonhypotensive events. The receiver operating characteristics curve corresponding to this biased selection has a characteristic “skew” toward high specificity and a markedly increased AUC (fig. 3B).
Most Hypotension Prediction Index Validation Studies Seem Biased
We are aware of eight subsequent validation studies of Hypotension Prediction Index. In these studies, the Hypotension Prediction Index was either downloaded directly from a HemoSphere or EV1000 monitor (Edwards Lifesciences, USA), or calculated post hoc from the arterial waveform using the same algorithm. Only one of these studies compares the predictive performance of the Hypotension Prediction Index to that of the concurrent MAP value.
Three studies used the same data selection as the index publication and presented similarly skewed receiver operating characteristics curves with high specificity (Wijnberge et al. also did the “forward” analysis described below and is counted there as well). Another three showed a “skewed” receiver operating characteristics curve with a very high specificity, but did not specify exactly how nonhypotensive events were selected. Two used a “forward analysis” starting with a Hypotension Prediction Index alarm and looking in the next 20 min for hypotension. Both showed a high predictive performance of Hypotension Prediction Index (e.g., a positive predictive value of 80% and a negative predictive value of 96% at a Hypotension Prediction Index threshold of 85); however, the method for selecting individual prediction–outcome pairs was not explicitly described in the methods sections. Also, the Hypotension Prediction Index’s predictive ability was not compared with the concurrent MAP value’s predictive ability in these two studies, further complicating the interpretation of these results, as it remains unclear to what extent the Hypotension Prediction Index’s predictive performance is driven by the concurrent MAP value alone. The study by Ranucci et al. selected Hypotension Prediction Index values corresponding to both hypotensive events and nonhypotensive events 5 to 7 min before the event. By using data before the nonhypotensive events, they avoided creating a data set where samples corresponding to nonevents had an artificially high MAP, thereby avoiding the selection bias described above. The authors presented a “symmetric” receiver operating characteristics curve with an AUC of 0.768 for the Hypotension Prediction Index’s ability to predict hypotension. While this AUC may be more realistic than what is reported in other studies, it was based on just 77 hypotensive events, and the Hypotension Prediction Index performance was not compared with that of the concurrent MAP value.
Only Davies et al. compared Hypotension Prediction Index to MAP. Therein, the Hypotension Prediction Index predicted hypotension markedly better than the concurrent MAP value (AUC 0.926 vs. 0.807, respectively), and only the Hypotension Prediction Index receiver operating characteristics curve was skewed toward high specificity (see figure 2 in Davies et al. ). The methods section in the paper does not provide enough information about the data selection to explain why the Hypotension Prediction Index’s receiver operating characteristics curve indicates the presence of a selection bias while the MAP’s receiver operating characteristics curve does not.
Biased Data Make Biased Models
Using multiple features from the arterial waveform to predict hypotension is an admirable idea. However, we speculate that most of the potential added value is lost, because the selection bias forced the Hypotension Prediction Index to learn almost solely from MAP in its development: if MAP is less than 75 mmHg, the patient will be classified as hypotensive. When MAP is less than 75 mmHg, other features can only impair this “perfect” prediction. This creates a biased model that overestimates the risk of hypotension. The model will presumably overrepresent MAP and underrepresent other waveform features (at least when MAP is less than 75 mmHg). If our speculation is true, we should expect that the Hypotension Prediction Index is almost a one-to-one transformation of the concurrent MAP (where the Hypotension Prediction Index is high when MAP is low) with only a small impact from other features. The concurrent MAP and Hypotension Prediction Index values in figure 2 and the index publication’s figure 5 (not reprinted here) exemplify this one-to-one transformation.
Hypotension Prediction Index Should Be Revalidated
As the Hypotension Prediction Index algorithm may have been subject to selection bias in its development, and since most subsequent validation studies indicate a similar problem, the Hypotension Prediction Index may not predict hypotension as accurately as reported. We suggest that data from previously published Hypotension Prediction Index validation studies be reanalyzed, paying particular attention to ensure an unbiased selection of hypotensive events and nonevents. A more reasonable selection of nonevents could be similar to that by Ranucci et al. First, select all 1-min sections with MAP greater than 65 mmHg (optionally, also do an analysis requiring nonevents to have MAP greater than 75 mmHg, corresponding to the intentional “gray zone” implemented in the index publication and most validation studies. Samples before the event should not be restricted). Then, exclude nonevents in the first 15 min after an event. For both events and nonevents, the predictor should be Hypotension Prediction Index or MAP 5, 10, or 15 min earlier. This design will allow a reasonable receiver operating characteristics curve analysis for comparing the Hypotension Prediction Index’s and MAP’s predictive abilities, but it will likely result in an overrepresentation of nonevents, so reporting positive and negative predictive values will not be meaningful (see the section The Case–Control Design Itself Is Problematic). In our view, it is the difference between the Hypotension Prediction Index’s and the concurrent MAP value’s predictive abilities (e.g., AUCs) that represents the added clinical value of the Hypotension Prediction Index over simply using MAP to guide blood pressure treatment. It is imperative that any comparison of prediction methods be based on the exact same data selection and outcome labeling.
A recent paper appropriately suggested that a complex and proprietary algorithm like the Hypotension Prediction Index should be compared to a simple model that represents current clinical practice (as opposed to ΔMAP). The authors suggested a linear extrapolation from the current MAP value and the MAP value 1 min earlier (“LepMAP”). The study did not compare LepMAP and Hypotension Prediction Index directly, but when, in a secondary analysis (termed B-analysis), the authors applied a data selection to match that of the index publication, they found an AUC of 0.93 for LepMAP’s prediction of hypotension 2 min into the future, with receiver operating characteristics curves that were skewed toward high specificity. In the study’s A-analysis, which did not enforce the index publication’s selection, they compared the predictive ability of LepMAP to that of the concurrent MAP, and found that they were not statistically different, although the concurrent MAP’s AUC had higher point estimates. However, the paper does not address that the B-analysis creates a selection bias.
The Case–Control Design Itself Is Problematic
The data selection in the index publication and most subsequent validation reports (including that by Ranucci et al. and what we describe in the section Hypotension Prediction Index Should Be Revalidated) is based on a case–control design: it begins with selection of cases (hypotensive events) and controls (nonhypotensive events). Afterward, predictors (Hypotension Prediction Index or the waveforms used to calculate Hypotension Prediction Index) are selected based on these cases and controls (e.g., 5 min before hypotensive events and the midpoints of nonhypotensive events). Hence, most Hypotension Prediction Index values we observe in a clinical setting will not be represented in the analysis, because they are neither 5 min before a hypotensive event nor the midpoint of a nonhypotensive event. Therefore, even without the described selection bias, this case–control design is known to give results that may not be valid in a clinical setting. This problem may be exacerbated by the exclusion of “gray zone” outcomes (outcome MAP between 65 and 75 mmHg). The index publication argues that while false-positive Hypotension Prediction Index alarms with outcomes in this “gray zone” are not included in the receiver operating characteristics curve analyses, this is not an important limitation, and that these false positives could even be beneficial. Conversely, a review of the Hypotension Prediction Index argues that these false-positive alarms may lead to overtreatment. The case–control design can be useful for model development, but since the proportion of hypotensive events does not represent the true probability of hypotension, it does not allow calculation of clinically meaningful positive and negative predictive values. Neither the positive predictive value of 12.6% at Hypotension Prediction Index greater than 85, reported by Ranucci et al., nor the 93.2% at Hypotension Prediction Index greater than 39, reported in the index publication should be interpreted as the probability of imminent hypotension in a continuously monitored patient.
The process of representing clinical data as outcomes and predictors for a risk prediction problem has recently been introduced as the framing of the problem. A clinically relevant framing for investigating the Hypotension Prediction Index’s ability to predict hypotension could be as follows: When the Hypotension Prediction Index gives an alarm, how often does hypotension actually occur, e.g., 3 to 10 min later (i.e., positive predictive value); and when hypotension occurs, was there an Hypotension Prediction Index alarm 3 to 10 min earlier (i.e., sensitivity)? This way, all false-positive predictions will be counted—not just the ones corresponding to selected “nonevents.” We acknowledge that it is not trivial to decide what constitutes a single prediction, and we do not contend that the suggested framing is the single best solution. In the end, it is the model developers’ responsibility to make sure that an algorithm is developed and validated to truly address the targeted clinical problem.
Effect of Hypotension Prediction Index in Clinical Trials
Four randomized controlled trials have investigated the Hypotension Prediction Index’s preventive effects on hypotension. Three of these trials, and an additional retrospective study showed less hypotension in the Hypotension Prediction Index guided group compared to a control group (the trials vary in their outcome definitions and precise management protocols, but the characteristics are beyond the scope of this analysis). One trial also investigated the effect on postoperative hypotension and found no difference in their primary outcome between the Hypotension Prediction Index–guided group and the standard care group. The fourth and largest study noted no difference between the Hypotension Prediction Index–guided group and the control group in the primary outcome variable: amount of hypotension (time-weighted average MAP less than 65 mmHg). That hypotension may be reduced is important, but these results could be due to increased clinician awareness to hemodynamics (something that might also be achieved with a MAP alarm or treatment threshold of, e.g., 75 mmHg), and do not validate the predictive ability of Hypotension Prediction Index per se. In addition, preventing hypotension may come at the expense of overtreatment as indicated in the trial by Tsoumpa et al., where time spent in hypertension was increased in the Hypotension Prediction Index–guided group.
A selection bias in the development of the Hypotension Prediction Index may explain a relevant proportion of its reported predictive ability. The same bias seems present in most of the subsequent validation studies. In light of this, we are not yet convinced that using the Hypotension Prediction Index to predict hypotension is meaningfully better than using the concurrent value of MAP. We suggest that data from validation studies be reanalyzed considering the potential for this selection bias, and that the predictive performance of the Hypotension Prediction Index be compared with the predictive performance of the concurrent MAP value.