The problems with selecting on significance

Author

Ryan Briggs

Published

February 25, 2026

Current research production practices in the social sciences put a premium on statistically significant results. A common gloss on this is that our goal is to discover things, so journals prioritize results that are statistically distinguishable from zero. Authors know this and so often don’t even submit null results to journals. The end result is that between 2010 and 2024 94% of all statistical articles in political science claimed to reject a null hypothesis.

This might seem great—we’re discovering a lot!—but I worry that this research production ecosystem has major problems. In this post I’m going to explain why and when filtering research based on statistical significance, also known as selection on significance (SOS), is a problem. I will also propose and critique one alternative and discuss a final idea that could help ameliorate these issues.

I think a useful starting point in these discussion is to say what you want out of the published literature. I don’t have a full answer, but one thing that I would like is for published research coefficients to be unbiased. If a published result has an effect of 1, then I would like to (correctly) believe that if I were to somehow average across all (published and unpublished) research conducted on this topic with similar setups then I would get something like 1. Ideally, this would also be informative of future replications of the research too.

Most of what I’m going to present below is logical argument, but for people who want more clarity I have a simulation that more clearly shows my thinking at the end for people who like that kind of thing. It’s just a toy model and you should not anchor on any parameter value or result. Rather, it’s there to more clearly communicate how I’m thinking about things and so anyone else can modify it and see what they get. Finally, none of this is new but I hope that there is value in explaining all of this in plain language in one place.

Selection on Significance

Let’s think about what happens if you conduct many hypothesis tests and then filter them based on whether or not a p-value crossed a threshold like 0.05. The figure below shows a null distribution and a sampling distribution. The x-axis shows standardized effect sizes and the sampling distribution is centered at 1.96, meaning that there is a 50% chance of a draw from the sampling distribution ending up in the shaded blue rejection zone of the null. This is a test with 50% power.

The mean of our sampling distribution is 1.96. However, if we cut the sampling distribution at 1.96 then the mean of this truncated distribution (in red) is about 40% larger. SOS is removing the part of the distribution closer to zero, and so it is biasing the average away from zero. If power is lower, then the two distributions overlap more and so the truncation and therefore bias is more extreme.¹ As power increases, this bias goes down.

Political science and economics do seem to have low power to detect effect sizes that typically exist. Political science also has lots of selection on significance. This implies that our results will be systematically biased away from zero. This is empirically what we find.

The bias caused by SOS is an especially pernicious kind of bias because it exists as a filter above tests (or articles). This means that even if the tests themselves are unbiased, and even if one reads across literatures rather than trust a single article, someone taking our published coefficients at face value would still end up with a best guess mean whose magnitude is inflated.

Given how much effort political science and economics have spent on having unbiased tests within papers—this was the entire point of the credibility revolution!—one might expect that a lot of people will be upset about the idea that we have created a research production ecosystem that systematically biases our published literature. It at least upsets me. What can we do about this?

Selection on Precision

One alternative to SOS is to select results based on precision (small standard errors, or you can think of it as tighter confidence intervals). Because this selection process is not mechanically related to the magnitude or sign of coefficients, it will not produce biased coefficients.² Thus, if we select on precision then we can take coefficients at face value and meta-analysis will work.

Selecting on precision is also intuitive, as it’s selecting more informative tests. Very often when I talk about the importance of not screening out null results someone will (correctly) tell me that a noisy null is not very informative of effects. This is true, but of course the problem with the current system is that we regularly publish noisy but significant tests (which are likely extreme and “unlucky” draws). Selecting on precision applies the same rule to tests regardless of the coefficient value: pick tests with more precision.

There are at least two problems with this approach. The first is that, unlike SOS, we do not have a scale-free way to define precision. In other words, I can’t give a number like 0.05 to use for easy selection. This means that what counts as “precise enough” to be prioritized for publication will vary by research question and be up to the judgement of academics. I have no way around this. A world with SOP is a world where expert judgement is used more heavily than a world with SOS. I’ve framed this as a problem but I expect many people will see it as an advantage.

The second issue is more clearly a problem. If we produce many tests and then filter them based on precision, then the tests that make it through the filter will not have biased coefficients but they will have downwardly biased standard errors. This means that the confidence intervals from these tests will have undercoverage (the 95% intervals published under this scheme will be expected to cover the true value less than 95% of the time). This is a direct result of selecting based on precision and is unavoidable under this system.

One thing, however, that might make this bullet easier to bite is that SOS also biases standard errors down. This is because the test statistic is a ratio of coefficient and standard error and so selecting on significance filters for larger coefficients but also smaller standard errors. The relative magnitude of this bias depends on many factors so I can’t say that one is worse than the other, but if we ultimately care about the coverage of confidence intervals then SOS has an additional demerit. Because coefficients under SOS are biased away from zero, the intervals are also biased away from zero. Thus, it is possible to have a situation where standard errors are more biased under SOP than SOS and yet SOS has worse coverage than SOP. I show this in my simulation.

Registered Reports

Perhaps you want unbiased coefficients and unbiased standard errors. What can you do then?

The most attractive way to get both is with registered reports. These are essentially tight pre-registration plans that are peer reviewed and accepted for publication before the study is conducted. They do not work for all kinds of research and the planning to do them well is onerous, but they offer unbiased coefficients and unbiased standard errors.

Conclusions

The main takeaway is simple: we should not filter research based on statistical significance. With low power, SOS inflates effects and gives confidence intervals that look more certain than they are. SOP is an improvement, but it is sadly not a free lunch. It also has confidence intervals that under-cover and filtering in this way may make it so that the estimands that we estimate are not a good fit to the population parameters we care most about. Registered reports are ideal when possible, but many forms of research do not lend themselves to the registered report format.

So one realistic takeaway is that we should probably be using registered reports when possible, and otherwise move away from SOS. SOP seems like a good (if imperfect) alternative to SOS.

The simulation

This is a basic simulation, originally made by Vincent Arel-Bundock, and then vandalized by me to explore SOS and SOP. It loops a simple two-arm experiment and then presents bias ratios in the estimate and standard error and confidence interval coverage for selection regimes of: no selection, SOS, and SOP. It’s a toy model used mostly to explore and explain my intuitions. Do not over-index on any specific number.

set.seed(123)
library(marginaleffects)

sample_size <- 100
tau <- 0.30 # true treatment effect

draw <- function(N, tau) {
  D <- rbinom(N, 1, 0.5) # two arms: 0 = control, 1 = treatment
  e <- rnorm(N, 0, 1)
  Y <- tau * D + e
  data.frame(D = D, e = e, Y = Y)
}

fit <- function(d) {
  model <- lm(Y ~ D, data = d)
  hypotheses(model)[2, ] |> as.data.frame()
}

results <- do.call(rbind, replicate(1000, fit(draw(sample_size, tau)), simplify = FALSE))

selection_sets <- list(
  `No selection` = results,
  `SOS` = subset(results, p.value < 0.05),
  `SOP` = subset(results, conf.high - conf.low < .8)
)

summarize_selection <- function(d, full, tau) {
  coverage <- mean(d$conf.low <= tau & tau <= d$conf.high)
  c(
    `Estimate bias ratio` = mean(d$estimate) / tau,
    `SE bias ratio` = mean(d$std.error) / mean(full$std.error),
    `CI coverage` = coverage
  )
}

results_table <- do.call(cbind, lapply(selection_sets, summarize_selection, full = results, tau = tau))
results_table <- round(results_table, 2)
results_table <- as.data.frame(results_table)

if (requireNamespace("knitr", quietly = TRUE)) {
  knitr::kable(results_table, caption = "Selection-induced bias ratios and CI undercoverage")
} else {
  print(results_table)
}

Selection-induced bias ratios and CI undercoverage
	No selection	SOS	SOP
Estimate bias ratio	0.98	1.74	0.99
SE bias ratio	1.00	0.99	0.96
CI coverage	0.94	0.91	0.92

Footnotes

If power is very low then non-negligible portions of the sampling distribution can end up in both tails of the null. In this case, most significant effects will be greatly inflated, but also some will be wrongly signed!↩︎
SOP avoids SOS’s built-in truncation bias, but it could filter out estimands of interest. If high-precision studies cluster in particular settings or specifications, publication under SOP will over-represent those contexts. As a result, the published estimates may reflect where precise estimates are easiest to obtain, rather than some population parameter we may care more about.↩︎