Forcing a Deterministic View on Probabilistic Phenomena: Implications for the Replication Crisis
By Carol Ting
An often overlooked source of the “replication crisis” is the tendency to treat the replication study as a definitive verdict while ignoring the statistical uncertainty inherent in both the original and replication studies. This simplistic view fosters misleading dichotomies and erodes public trust in science. By analyzing the media coverage of the Open Science Collaboration’s “Reproducibility Project: Psychology” (RPP), this commentary draws attention to this common neglect of statistical uncertainty and highlights the need to discuss replication studies in the context of cumulative evidence.1
Method Uncertainty vs. Statistical Uncertainty
Method uncertainty encompasses the variability of outcomes arising from differences in research and analysis methods. The most recognized source of method uncertainty is the varying extent of control of potentially confounding factors. Additionally, method uncertainty can also arise from the criteria for selection into the study and the implementation of statistical procedures, such as how variables are coded.
The commentary centers on the second kind of uncertainty, statistical uncertainty, which holds a prominent position in biology and human sciences but is unfortunately often overlooked in interpretations of results. Unlike phenomena studied in physical sciences, which are governed by universal and exact laws, those involving complex living organisms are characterized by fuzzy boundaries between categories, intricate feedback loops, and complex natural variation (Mayr, 1985; Nelson, 2016). This added layer of complexity, along with the sampling variation and “noise” it introduces –modeled as “statistical uncertainty” – is what renders biomedicine, psychology, and social sciences “soft”, as they explore phenomena whose regularities are amorphous and situational.
Forcing a Deterministic Frame on Continuous Probabilistic Phenomena
Unfortunately, our desire for certainty misleads us to approach statistics as if it could eliminate all uncertainty and to treat hypothesis testing as if it could transform a continuous probabilistic phenomenon into a deterministic dichotomy (statistically significant vs. statistically insignificant, Gelman, 2016; Gelman & Loken, 2014). However, neither dichotomizing results (based on whether single p-values exceed a cutoff value) nor using isolated p-values for decision or judgment is justified. This is because the p-value is not only a function of what would be purely random error under the test hypothesis, but also depends on all other assumptions used to compute them (Greenland, 2019). The latter assumptions cannot be assured in soft sciences, which typically deal with small effect sizes and deviations from assumptions that to lead uncontrolled biases (Greenland, 2017). Consequently, there is a low ratio of true signal (effect) to random noise and bias, with the bias resulting in more erroneous decisions than indicated by the statistical theory. The dichotomous deterministic usage of p-values thus can be misleading, contributing to the “replication crisis” in two key ways.
Firstly, even in the presence of a true effect, the likelihood of a pair of (original and replication) studies yielding “inconsistent” results is a function of power and is exaggerated at typical power levels: For a power level lower than 50%, which is typical in soft sciences (Bakker et al., 2012; Button et al., 2013; van Zwet et al., 2024), the likelihood of apparently conflicting results can be as high as 50% in the presence of a true effect. At a 20% power level, the rate of observing conflicting results is still 32% given a true effect. Surely, a 32% to 50% rate of conflict is a gross exaggeration when no actual conflict exists.
Secondly, this dichotomous deterministic frame also exacerbates the “replication crisis” through journals’ selective publication of statistically significant results (“the publication bias”). When statistical power is low, only the most extreme results will clear the “statistically significant” bar (Amrhein et al., 2019; Gelman & Carlin, 2014; van Zwet et al., 2021), resulting in a literature biased with inflated effect sizes. The inflated effect sizes in published literature in turn misguide the design of replication studies and lead to underpowered studies. Moreover, as initial results in literature tend to be outliers, results of replication studies tend to regress to the mean. Together, underpowered studies and regression to the mean further increase the likelihood of initially positive findings to “fail to replicate” in subsequent studies (Maxwell et al., 2015; van Zwet et al., 2021, 2024).
A re-examination of the RPP results can illustrate this problem (Amhrein et al.,2019): assuming a more realistic 50% power (instead of the 92% used in the RPP), the “low replication rate” of 36% observed in the RPP could be observed even if only 30% of the original studies were false positives. This should caution us against the hasty conclusion that most or all the original studies are false positives, and warn us that many “replication failures” arise from the statistical variability inherent in both the replication and original studies.
To demonstrate this problem, the commentary includes an analysis of the media coverage over the RPP (59 non-duplicated news reports and 15 opinion pieces published in 2015 from a total of 52 sources). The analysis reveals a clear pattern: while almost all reports and opinion pieces cover method uncertainty, only one report and two opinion pieces touch upon statistical uncertainty. Notably, among ten experts who expressed reservations about RPP outcomes, only one interviewed expert pointed to statistical uncertainty as an explanation for “failed” replication, further highlighting statistical uncertainty as a major blind spot in pertinent public discussion. Additionally, the analysis documented several noteworthy instances where reports implicitly associated “non-replication” with false positives and even insinuated fraud. Such labeling is not only premature but also undermines public trust in science, which is one of the biggest challenges facing science today.
Discussion
Replication projects and follow-up efforts are a partial solution to rebuilding science’s credibility, but cannot succeed without tackling the problem of statistical misconceptions. As demonstrated, blind spots regarding statistical uncertainty in replication projects exaggerate irreplicability and ultimately undermine trust. Thus, to reach trustworthy conclusions, we not only need high-quality data from sound studies, but we also need to be patient for studies (the data points) to accumulate enough to provide a clear picture of actual effects and their variations across study settings. There is no short cut, and hastily drawing conclusions based on a single replication study is just as unwise as blindly trusting the original study.
References
Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(sup1), 262–270. https://doi.org/10.1080/00031305.2018.1543137
Bakker, M., Dijk, A. van, & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7(6), 543–554. https://doi.org/10.1177/1745691612459060
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475
Gelman, A. (2016). The problems with p-values are not just with p-values. The American Statistician, supplemental materials to ASA Statement on p-Values and Statistical Significance, 70, 1–2. http://www.stat.columbia.edu/~gelman/research/published/asa_pvalues.pdf
Gelman, A., & Carlin, J. (2014). Beyond power calculations. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642
Gelman, A., & Loken, E. (2014). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. https://statmodeling.stat. columbia.edu/2021/03/16/the-garden-of-forking-paths-why-multiple-compari- sons-can-be-a-problem-even-when-there-is-no-fishing-expedition-or-p-hacking- and-the-research-hypothesis-was-posited-ahead-of-time-2/
Greenland, S. (2017). The need for cognitive science in methodology. American Journal of Epidemiology, 186, 639–645. https://academic.oup.com/aje/arti- cle/186/6/639/3886035
Greenland, S. (2019). Valid p-values behave exactly as they should: Some misleading criticisms of p-values and their resolution with S-values. The American Statistician, 73(Suppl. 1), 106–114. https://doi.org/10.1080/00031305.2018.152 9625
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487-498. https://doi.org/10.1037/a0039400
Mayr, E. (1985). How biology differs from the physical sciences. In D. Depew & B. Weber (Eds.), Evolution at a crossroads: The new biology and the new philosophy of science (pp. 43–63). A Bradford Book.
Nelson, R. R. (2016). The sciences are different and the differences matter. Research Policy, 45(9), 1692–1701. https://doi.org/10.1016/j.respol.2015.05.014
Van Zwet, E., Schwab, S., & Greenland, S. (2021). Addressing exaggeration of effects from single RCTs. Significance, 18(6), 16–21. https://doi.org/10.1111/1740-9713.01587
The Article
Ting. C., & Greenland, S. (2024). Forcing a deterministic frame on probabilistic phenomena: a communication blind spot in media coverage of the “replication crisis.” Science Communication. https://doi.org/10.1177/10755470241239947
For an open access version, please click here.
Related Work
Ting, C., & Montgomery, M. (2023). Taming human subjects: researchers’ strategies for coping with vagaries in social science experiments. Social Epistemology, 1-17. https://doi.org/10.1080/02691728.2023.2177128
About the Author
Carol Ting is an assistant professor at the Department of Communication, University of Macau. Her research interest focuses on social science methodology and, in particular, factors complicating research replication/reproducibility. Her recent work was published in Social Epistemology, the International Journal of Social Research Methodology, and the Social Science Journal. For more information about Carol’s work, please see: https://comm.fss.um.edu.mo/prof-carol-ting/
Summary prepared by Carol Ting, who takes sole responsibility for all errors within it.