In 1997, John C. Bailar III published an editorial in the New England Journal of Medicine entitled “The Promise and Problems of Meta-Analysis.”1 In that paper, Bailar acknowledged that meta-analysis held great potential for revolutionizing the field of medicine, but he also cautioned that meta-analyses were likely to be performed poorly, which would have serious implications. In particular, he was concerned that researchers would focus on the pooled mean effect size for a set of studies and ignore the fact that the effect of a treatment might vary across populations.
At the time, I thought Bailar was being overly pessimistic. However, over the 25 years since his editorial was published, I have come to realize that he was correct to be concerned. The majority of published meta-analyses do make the mistakes he warned about. Indeed, some of these mistakes have even been codified and recommended in guidelines.
For that reason, I am pleased to see the two papers2,3 in this issue of Anesthesia & Analgesia, copublished in Regional Anesthesia & Pain Medicine, which provide a roadmap for correctly performing a systematic review and meta-analysis. They should serve as an excellent resource for those who plan to conduct a meta-analysis and for those who need to evaluate a meta-analysis conducted and published by others. For clarity, I’ll refer to these papers as Parts 12 and 2.3
The papers include sections that address the importance of specifying the relevant populations, interventions, comparisons, and outcomes. They include sections that detail the structure of a literature search. They include sections that deal with statistical models. And they include sections that address heterogeneity.
However, to fully understand the materials in these papers one needs to approach them with the proper framework. Specifically, it is imperative to dismiss several myths that permeate the literature.4,5 I therefore address those here.
MYTH # 1: THE PRESENCE OF HETEROGENEITY DIMINISHES THE UTILITY OF META-ANALYSIS
It is widely believed that that the presence of heterogeneity diminishes the utility of a meta-analysis. However, the truth is more nuanced. To understand why, it helps to think of meta-analyses as falling into one of two categories.
In one type of meta-analysis, we define the studies of interest very narrowly. For example, we may limit the review to studies that enrolled a very specific population, used a specific variant of the intervention, and assessed outcomes in very similar ways. In this case, we expect the effect size to be essentially the same in all studies, and our goal is to identify that common effect size. If it turns out that the effect size varies substantially across studies, we would neither be able to report a common effect size nor to explain why the effect size varies. In this kind of analysis, the idea that heterogeneity diminishes the utility of meta-analysis would indeed be correct.
However, many currently published meta-analyses fall into a second category. In this kind of analysis, we define the studies of interest more broadly. For example, we may elect to include studies in which the populations and/or the intervention, comparator, or way of assessing outcome varies. In this case, we generally would not assume that the impact of treatment is consistent across studies. Rather, we would expect that the impact would vary. Critically, the goal of the analysis is no longer to estimate the common effect size but rather to estimate the distribution of effects. In this case, the heterogeneity is not a problem. Rather, the heterogeneity itself is the information we sought to obtain. For example, if the effect varies from clinically moderate to clinically large, we know that the treatment is useful in all cases. By contrast, if the effect varies from clinically trivial in some cases to clinically large in others, we know that we need to identify where it works and where it does not.5
MYTH # 2: THE I-SQUARED STATISTIC (I2) TELLS US HOW MUCH THE EFFECT SIZE VARIES
It is widely believed that the I2 statistic tells us how much the effect size varies across studies. However, this belief is incorrect.
When we ask about heterogeneity we are asking: “Over what interval is the effect size expected to fall, and what are the clinical implications of this dispersion?” If an intervention has a moderate clinical impact on average, we also need to know if the impact is moderate for all relevant populations, or if it varies from major in some to trivial (or even harmful) in others. If the risk of death after mitral valve surgery in octogenarians is 20% on average, we also need to know if the mortality risk is close to 20% for all relevant populations or varies from near 0% in some to as high as 60% in others. This is what we have in mind when we ask about heterogeneity.
Ironically, while researchers recognize that it is important to take heterogeneity into account when considering the potential effect of an intervention, the way heterogeneity is reported often makes this goal impossible. The majority of meta-analyses use the I2 statistic to quantify heterogeneity. While this practice is ubiquitous in many fields and is recommended in some guidelines, it is nevertheless a mistake. The I2 statistic does not tell us how much the effect size varies. It was not intended for that purpose and cannot provide that information (except when I2 is 0%).
If that seems surprising, consider the analysis by Hussain et al6 (discussed in Part 2) that compared the impact of erector spinae plane block (ESPB) versus parenteral analgesia as the control, after breast-cancer surgery. The primary outcome was cumulative oral morphine equivalent consumption at 24 hours. The researchers suggested a priori that a difference of 30 mg when compared with the control group would be clinically important. The mean difference turned out to be only 17.6 mg, which did not meet this threshold.
However, to understand the potential utility of this treatment we need to also consider the dispersion in effects. Is it the case that the impact consistently falls within 10 mg of the mean: roughly 30 mg in favor of ESPB at one extreme to 10 mg in favor of ESPB at the other? Or is it the case that the impact sometimes falls as much as 50 mg from the mean: 70 mg in favor of ESPB at one extreme, to 30 mg in favor of control at the other? In the first scenario, we would conclude that ESPB holds little promise. In the second scenario, we might conclude that ESPB is very effective in some cases but harmful in others, and we need to figure out where this treatment works and where it does not. So, which of these is true?
Hussain et al6 reported that I2 is 97%. On that basis, are we dealing with the first scenario or the second? The answer is, there is no way to know. The I2 statistic simply does not provide this information. As explained in Part 2, the I2 statistic is a proportion, not an absolute amount. It tells us what proportion of the variance we see in the forest plot reflects variance in true effects as opposed to sampling error. It does not tell us how much the effects vary on an absolute scale.
Imagine a group of clinicians discussing the potential utility of ESPB based on the fact that I2 is 97%. Some might assume that the effects fall as much as 50 mg from the mean; some might assume that the effects fall as much as 30 mg from the mean; and some might assume that the effects fall within 10 mg of the mean. The clinicians would be trying to reach a consensus, but each of them would be working with a different understanding of the facts. And it is not their fault. Given that I2 is 97%, any of these distributions is possible.
So, how widely does the effect size vary? As explained in Part 2, the statistic that does provide this information is the prediction interval, which (if the analysis includes a sufficient number of studies) may be estimated as the mean effect +/- 2 standard deviations.7 In this analysis the prediction interval is estimated as −43 to +7 mg. This tells us that in 95% of cases, the true impact is expected to fall between 43 mg in favor of ESPB at one extreme, and 7 mg in favor of control at the other. Thus, there are relatively few cases where the effect meets the criterion for being clinically useful (30 mg), and a small number where it might be harmful.
Clinicians might still have different opinions about whether this treatment should be pursued. That discussion is necessary and welcome. However, when we are working with the prediction interval, the discussion would be based on the actual results.8,9
MYTH # 3: WE SHOULD CLASSIFY HETEROGENEITY AS BEING LOW, MODERATE, OR HIGH BASED ON I2
There is a common practice of using the I2 statistic to classify heterogeneity as being low, moderate, or high based on cutoffs such as 25%, 50%, and 75%. This practice should be abandoned.
As discussed above, the I2 statistic does not tell us how much the effect size varies. Therefore, any classifications based on I2 cannot tell us how much the effect size varies. In the ESPB example where I2 was 97%, this approach would classify the heterogeneity as “high,” when in fact the clinical impact of the treatment was mostly consistent across studies.
By contrast, the prediction interval does tell us how much the effect size varies. Critically, it reports this on the same scale as the effect size itself—in units that are clinically meaningful. The statement: “In some 95% of cases the true impact is expected to fall between 43 mg in favor of ESPB (at one extreme) and 7 mg in favor of control (at the other)” tells us what we need to know in language that is clear and concise.10 The prediction interval is the statistic that addresses the question we have in mind when we ask about heterogeneity, and that researchers often (incorrectly) believe is addressed by I2.
A FRAMEWORK FOR READING THE TWO HIGHLIGHTED PAPERS
Once we abandon these myths, we can recommend a framework for thinking about the two papers highlighted herein.
In some settings, our goal will be to assess the impact of a specific intervention, in a specific population, under a specific set of circumstances. In that case, we would define the kinds of studies very narrowly as discussed in the section on PICO (population, intervention, control, and outcomes) in Part 1. We would expect that any variation in the observed effects would be due primarily to random sampling error, and that the true effect size, from a clinical perspective, is essentially the same in all studies. As discussed in Part 2, the way we define “essentially the same” is not based on the I2 statistic. Rather, it is based on a prediction interval. If the effect size does turn out to be essentially the same in all studies, we would be able to report this (common) effect size with good precision. However, the results would only apply to this population and this variant of the intervention.
In other settings, our goal will be to assess the impact of an intervention across populations, across variations of the intervention (eg, dose, mode of administration), across circumstances (type of surgery), and so on, as discussed in Part 1. In this setting, we will likely expect the effect size to vary across studies and our goal would be to describe the distribution of effects. Again, as discussed in Part 2, the way to do this is to report the prediction interval and then consider the clinical implications.
When there is clinically important heterogeneity, the final step would be to identify moderators, such as patient type or dose, which are associated with that heterogeneity. If we have coded the data for these moderators, it may be possible to identify those that are related to the effect size. However, as explained in Part 2, these relationships or associations are observational and not causal.
In summary, the 2 papers provide an excellent resource for planning and performing a systematic review and meta-analysis. However, to fully understand these papers, readers need to approach them with an open mind, and to recognize that much of what they have been taught about meta-analysis may be incorrect. In this editorial, I elected to focus on myths related to heterogeneity, but there are also myths related to publication bias, statistical models, subgroup analyses, and other aspects of meta-analysis. I address those in a PDF that can be downloaded for free at https://www.Meta-Analysis.com/anesthesia.4
5. Higgins JP. Commentary: heterogeneity in meta-analysis should be expected and appropriately quantified. Int J Epidemiol. 2008;37:1158–1160.