P Valueless? Statisticians Bidding Adieu to ‘Statistical Significance’

Author: Alison McCook

Anesthesiology News

Busy clinicians who want to stay on top of the literature often only have time to scan abstracts in their favorite journals. More often than not, authors will note whether the data they are reporting are statistically significant—meaning, after performing a hypothesis test of the data, the calculated P value is less than 0.05. Even if readers don’t know what a P value represents, or how to calculate it themselves, years of reading scientific literature have drilled one concept into their minds: Results with P values less than 0.05 are “statistically significant”; data with P values higher than 0.05 are not.

Statisticians around the world would like to change that.

In a spate of recent editorials, statistics experts are calling for researchers to move away from using P<0.05 as an arbitrary threshold of data validity, and abandon the emphasis on statistical significance. The reason, they say, is that those tools have become an all-or-nothing lens through which people view data, and only P values less than this arbitrary cutoff are worth reporting.

“Researchers tend to think about P values in a dichotomous way: ‘P=0.05 is golden, but P=0.051 is out of consideration.’ That’s bad,” said Nicole Lazar, PhD, MS, one of the authors of an editorial introducing a special issue of the American Statistician dedicated to this topic. The reason: Statistically significant data are not always clinically meaningful. “We’re trying to shift the focus back to clinical meaning.” The limitations of and concerns around P values “is nothing new to us,” said Brennan M.R. Spiegel, MD, a co-editor of the American Journal of Gastroenterology. P values aren’t “evil,” Dr. Spiegel said in an interview. They’re just “one piece of the jigsaw puzzle.” Rather, he said, P values should be considered “part of a menu of metrics that we apply when we evaluate the significance of a paper.”

The Problem

On March 20, Dr. Lazar, a professor of statistics at the University of Georgia, in Athens, and her colleagues published more than 40 articles as part of a special issue of the American Statistician. Each paper decried the overreliance on P values when determining the significance of data. That same day, a comment in Nature argued the same thing, and included support from more than 800 signatories. In the comment, the authors present two studies that both found anti-inflammatories increase the risk for atrial fibrillation by 20%. But only one had a P value of less than 0.05. The other paper reported the 20% increase as nonsignificant; there was no higher risk for atrial fibrillation when taking the drugs, and the results of the two studies contradicted each other. The Nature authors agreed: The situation was “ludicrous.”

It’s “nice to have the reminder” to include other metrics that might better capture clinical relevance, said Susan Hutfless, PhD, SM, a statistics associate editor at Gut and an associate professor at the Johns Hopkins University School of Medicine, in Baltimore. “There will be times to use statistical significance testing alongside clinical/logical interpretation. I agree that saying that something is statistically significant or not without consideration of the direction and magnitude of the relationship and measure of spread [confidence intervals] should be avoided.”

The Solution

So how should busy clinicians evaluate a paper, if they don’t rely on P<0.05? Ideally, readers won’t simply scan a paper that might have an influence on their practice, said Ruben Hernaez, MD, MPH, PhD, another statistics associate editor at Gut. “Most articles now include summary boxes to provide the ‘takeaway’ items—often without any stats at all—to assist our busy lifestyles. Much can be learned from reading the full text for study design, confounder control and so many other factors that come into play besides the statistical significance.”

The most important factor, said Dr. Hernaez, is whether the findings appear consistent over several studies. “We think that ‘magic’ of 0.05 threshold might be tested with what we call validation studies, meaning that the study is done elsewhere and obtain consistent results,” he said.

If clinicians really only have time to consider one metric in each paper, Dr. Spiegel recommended looking at the number needed to treat (NNT), which shows how many patients you have to treat with the drug in question to have one additional benefit. The NNT could be statistically significant or not, he said; but if it’s 20—meaning doctors have to treat 20 people in a row with one drug instead of another to have one benefit—that’s too many, he said, even if the P value is less than 0.05. “If a paper is looking at a new drug, technique or intervention, [NNT] is a clinically useful metric.”

Other metrics he considers when evaluating the significance of a paper include clinical effect size—often measured by the Cohen’s d statistic—and 95% confidence intervals. Some types of judgment can’t be quantified: Dr. Spiegel said he applies an “eyeball” test to see if the results appear clinically meaningful, and sometimes asks reviewers to ask the “so what” question when evaluating a paper. “That is answered through a combination of statistical significance and clinical relevance.”

Dr. Spiegel said he doesn’t expect the statistics present in submissions to change overnight—nor should they. If authors submitted a manuscript devoid of P values, he would not reject it solely for that, he said. But if other statistics are included, he might question the decision, since he believes P values are one of many tools readers should use to evaluate a paper. “I do look forward to removing the emphasis on the P value as the singular importance,” Dr. Spiegel said. “But to just discard it completely, I think would be a mistake.”

Leave a Reply Cancel reply

Testimonials

Add a Testimonial