Study authors are increasingly reporting P values in their abstracts, but the statistic is not the best measure of success or failure and is too often misleading, according to an analysis of more than 12 million abstracts.
For physicians, this is more than a theoretical discussion of methodology, as P values are used in decisions about whether treatments work and whether they have harms. "Consequences can be really grave," said senior author John Ioannidis, MD, DSc, professor of disease prevention and of health research and policy and codirector of the Meta-Research Innovation Center at Stanford University in California.
In fact, the findings may have wide implications for the overall validity of biomedical literature, say David Chavalarias, PhD, from the Complex Systems Institute of Paris Île-de-France, and colleagues in an article published in the March issue of JAMA. The researchers electronically mined MEDLINE and PubMed Central for the appearance of P values and manually reviewed 1000 abstracts and 100 full papers.
They found that use of the P value had more than doubled from 7.3% in 1990 to 15.6% in 2014. In 2014, P values were reported in 33.0% of abstracts from the 151 core clinical journals (n = 29,725 abstracts) and 54.8% of randomized controlled trials.
Furthermore, almost all (96%) of the abstracts with P values had at least one that was "statistically significant," they report.
"The fact that you have so many significant results is completely unrealistic," Dr Ioannidis said in a statement.
P values have long been debated, with some scientists saying they should be eliminated altogether.
Dr Ioannidis told Medscape Medical News that he is not calling for their elimination but, rather, their proper use, along with indicators such as effect size, which includes odds ratios and risk differences, and confidence intervals, which indicate degree of certainty about the results.
He also advocates use of false-discovery rates or Bayes factor calculations, which estimate how likely a result is to be true or false.
The authors found that of the 796 papers manually reviewed that contained empirical data, only 111 (13.9%) reported effect sizes, and only 18 (2.3%) reported confidence intervals. Fewer than 2% reported both an effect size and a confidence interval. None reported Bayes factors or false-discovery rates.
Dr Ioannidis and a team of physicians and scientists helped draft a statement from the American Statistical Association, published online March 7 in the American Statistician, regarding P values, which called for researchers to employ "a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean." The authors add, "No single index should substitute for scientific reasoning."
The statement is accompanied by multiple essays on ways to change reliance on P values.
Researchers have relied on P values because they are fairly easy to derive with automated software, Dr Ioannidis told Medscape Medical News. "You don't need any substantial training, and it's very easy to pass the threshold of the test."
He continued, "It's so tightly linked to increasing your chances of having the paper accepted and allowing you to claim success and allowing you then to ask for more money for funding. I think that's probably the key motive for why people use it."
Changing that pattern must come from better training of researchers and changing policy among gatekeepers, including journal editors, in demanding better indicators of whether an exposure or treatment works, he said.
Another option is "that we try to make sure that for all research, each subject matter expert or scientist will have to team with an experienced professional statistician and/or methodologist and make sure that the statistical component is up to par and Pvalues are not misused," he said.
P values are a measure of statistical significance, typically set at less than .05, intended to help readers interpret scientific conclusions. What is widely misunderstood, the authors write, is that P values cannot tell you whether a result is true, or how likely it is that something has no effect.
In a previous essay on the problems with the P value, Steven Goodman, MD, MHS, PhD, professor of medicine (general medical disciplines) and of health research and policy (epidemiology), noted that it is hardly the researchers' fault that the statistic is hard to interpret, as the statistician and biologist R. A. Fischer, who suggested it could be used, "never could describe straightforwardly what it meant from an inferential standpoint."
Despite that difficulty, Dr Goodman does provide a relatively straightforward description of what a P value does mean. "The operational meaning of a P value less than .05 was merely that one should repeat the experiment. If subsequent studies also yielded significant P values, one could conclude that the observed effects were unlikely to be the result of chance alone. So 'significance' is merely that: worthy of attention in the form of meriting more experimentation, but not proof in itself."
Problem Is Not the P Value Itself
Demetrios N. Kyriacou, MD, PhD, a senior editor of JAMA and professor in the Department of Emergency Medicine at Northwestern University Feinberg School of Medicine in Chicago, Illinois, who wrote an accompanying editorial, told Medscape Medical News he agrees with the authors that the problem is not with the P value itself, but the misunderstanding of it and the rigidity of the .05.
"If you had a cancer study that compared a new drug to an old treatment and found a P value of .051, that's not much different than a P value of .049. But to determine something is significant is .049, that's really not scientific. It's rather arbitrary," he said.
He gave another example using an imaginary hypothesis that coffee causes lung cancer. Such a study 50 years ago might have shown an impressive P value, but a scientist needs more information to determine that it is not the coffee drinking, but the smoking that can go along with the coffee drinking that causes the lung cancer, he explained.
"The P value gives you a sense of whether there's a relationship, but doesn't tell you whether it is causal or not," he said. "P value gives you mathematical measure of the relationship, but it doesn't tell you if the relationship has scientific validity."
He said the study by Dr Chavalarias and colleagues pointed out a lack of progress in improving scientific literature.
"This study really highlights how few articles are including confidence intervals, which are really more important than P values," he said. "That has been the recommendation of statisticians and epidemiologists for many years, but I think clinicians doing research or investigators feel they have reached statistical significance and that's the most important part."
He said the greatest basic danger of incorrect use of the P value is that "it could lead to a belief that some finding is true when it's not or not true when it is."
The Meta-Research Innovation Center at Stanford is supported by the Laura and John Arnold Foundation. The work of Dr Chavalarias is supported by the Complex Systems Institutes of Paris Île-de-France, the Région Île-de-France, and a grant from the CNRS Mastodons program. Coauthors were supported by the Canadian Institute for Health Research with a Michael Smith Foreign Study Supplement and a gift from Sue and Bob O'Donnell to Stanford Prevention Research Center. The authors and Dr Kyriacou have disclosed no other relevant financial relationships.