Contained Experiments, Reproducibility Crisis?, Six Types of Metascience, and Personal Predictions
Contained Experiments
Over on Never Met a Science, Kevin Munger discusses his new preprint on “Contained and Uncontained Experiments.”
Contained experiments are those for which it is plausible to “perform the same procedure” and expect “the same result.” Uncontained experiments are those for which practitioners expect intrinsic variation to dominate; even when performing the same procedure, the goal is to understand how results differ across time and context. I propose three conceptions of containedness: pragmatic (community expectations about replication), complexity (information required to specify the procedure), and temporal (speed of experimentation relative to drift in the phenomenon).
I was particularly interested in the pragmatic conception.
The pragmatic conception depends on the extent to which the conductors of an experiment and the scientific community with whom they interact expect the experiment to be able to be replicated. If a “contained” experiment is replicated and different outcome results, the community takes this as strong evidence for a problem. If the same occurs for an “uncontained” experiment, this is taken as evidence for some important variation in the context or implementation of the two experiments.
As Munger (2026) noted, Feest (2024) has made a related distinction between “effect seekers” and “context mongers.” This distinction is important because it predicts
different flavors of solution to the replication crisis. The former are at least nominally interested in behavioralist stimulus-response patterns: these are the “effects” that they seek to produce and, then, re-produce when replicating a procedure. Their prescriptions tend to be methodological, premised on the idea that failures to re-produce these effects are caused by some combination of statistical and operational procedures. The “context mongers,” in contrast, argue that this psychology research is undertheorized and that replication failures stem from an incomplete account of all of the dimensions of complexity involved in conducting those experiments.
Munger asks “why do different scientific communities have different pragmatic conceptions of containedness?” It’s an important question that deserves more attention!
Munger, K. (2026, June 16). Contained and uncontained experiments. http://kmunger.github.io/pdfs/metascience_67.pdf
Reproducibility Crisis?
Berna Devezer, Elizabeth Campolongo, and Phillip Popovich took part in a panel discussion at the Ohio State Center for Ethics and Human Values in which they responded to the question: “Is There a Reproducibility Crisis in Science?”, focusing in particular on ethical and practical issues.
Popovich: “Let’s face it: Science papers are stories, and the story rarely occurs in the order in which it is presented.”
Devezer: On Bem’s (2011) precognition study: “We know that is not a good study. Why run it again and again and again?”
Six Types of Metascience
Caleb Watney and Jenn Gustetic discuss “The Six Camps of Metascience” over on Macroscience.
If policymakers and agency leaders are going to change how they spend billions of dollars of taxpayer money based on metascience advice, they should have a clear picture of the landscape of metascience and its various goals, perspectives, and terms.
They identify six “camps” in metascience:
Watney, C., & Gustetic, J. (2026, June 11). The six camps of metascience. Macroscience. https://www.macroscience.org/p/the-six-camps-of-metascience
There have been other approaches to illustrating the diversity of metascience. In particular, Peterson and Panofsky (2023) located metascience at the intersection of the science of science, open science, and methodological activism.
My own, admittedly biased and incomplete, taxonomy would:
characterize the promulgation of open science as “Activist Metascience” (Peterson & Panofsky, 2021);
describe work investigating research integrity issues as “Forensic Metascience” (e.g., Heathers, 2025);
merge Watney and Gustetic’s categories of “innovation economics,” “R&D management,” and “institutional entrepreneurs” into a single category, perhaps called “Institutional Metascience”; and
acknowledge four other lesser-known but nonetheless important areas:
“Theoretical Metascience” (e.g., Devezer, 2026; Munger, 2025)
“Feminist Metascience” (e.g., Cole & Gopalakrishna, 2025; Pownall, 2026)
“Critical Metascience” (e.g., Rubin, 2026; Ulpts & Bartscherer, 2025)
“Developmental Metaresearch” (e.g., Forscher & Schmidt, 2024; Schmidt, 2024)
So, I’d end up with the following…
“10 Years Ago…”
Mark Schaller published a paper on “the empirical benefits of conceptual rigor” in 2016 in which he argued that the replication crisis is partly due to researchers conceptualizing research hypotheses as personal predictions rather than theoretical predictions. Echoing Popper, he argued that:
Although scientists may subjectively experience hypotheses as their own personal predictions, a more formal approach to scientific inquiry treats scientific hypotheses as statements that have an independent logical status—independent of the scientists who articulate them and test them, and independent of the extent to which scientists personally believe them to be true or false.
Any truly rigorous approach to psychological science requires that scientific hypotheses cannot be equated to personal predictions; hypotheses must instead be articulated as depersonalized products of some systematic analysis, and appraised accordingly.
Schaller argued that conceiving hypotheses as personal predictions is likely to produce unrealistic expectations about replication rates because personal predictions are invested with self-serving biases and motivated reasoning that bias researchers towards viewing effects as less likely to be false positives and more likely to be larger and more generalizable than they really are. Hence:
If scientists can deliberately avoid thinking about hypotheses as personal predictions, then they are less likely to optimistically assume hypothesized effects to be truer, bigger, and broader than they actually are.
Schaller, M. (2016). The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too). Journal of Experimental Social Psychology, 66, 107-115. https://doi.org/10.1016/j.jesp.2015.09.006





The "contained vs. uncontained" distinction lands hard.
There's a parallel in Buddhist epistemology that might be worth a footnote. The pramana tradition (roughly: the study of valid cognition, developed by Dignaga and Dharmakirti in 5th-7th century India) was obsessed with exactly this question: what makes knowledge genuinely reliable vs. context-bound?
Their answer was unsettling. All knowledge involves universals that the mind constructs. And those constructions are always partly context-dependent. Which puts most social science in the "uncontained" camp by design, not by failure.
The "effect seeker vs. context monger" framing maps onto this well. Effect seekers want a universal that survives replication. Context mongers suspect the universal is the artifact, not the data.
I don't know if this helps resolve anything. It might just be a different vocabulary for the same puzzle. But has anyone looked at whether the pramana literature has tools the metascience debates haven't reached yet?
The pragmatic conception is the one everyone reaches for, and you make a strong case for it. But the temporal conception beside it may cut deeper, because it can dissolve the category of replication failure rather than relocate it. It allows a third reading of a divergent result, beyond method error or context: the original was true, the replication is true, and the phenomenon moved in between. Much social behavior drifts on roughly the timescale of the research. The reflexive cases are sharper, where publishing the finding is part of what shifts the behavior it described. A priming effect fades as the trick becomes known; a norm changes once it is named. There the experiment is not measuring a stable target, it is nudging it. So 'does it replicate?' can hide a tense problem, treating a time-indexed truth as timeless. Part of the reproducibility crisis may be the discovery that some findings had expiration dates we never thought to print.