P-Hacking and Shrinkage

Andrew Gelman is seemingly always talking about the overstating of the estimated size and precision of effects, particularly ones that are implausible (like ESP) or could reasonably be positive or negative.

The problems which generate these unreliable (unreplicable) claims fall into two categories: statistical and behavioral.

The statistical problems come from the fact that there are two ways to make the sampling distribution of a statistic wider, give it less data or give it noisier data. Making the sampling distribution of a statistic wider makes more extreme values more common, which is where the behavioral problems come in. If your interest is in finding big effects that are counterintuitive (because they aren't real) then making underpowered comparisons with noisy data is the way to go.

I am going to do a short simulation that shows how these two things interact to produce shoddy science, and how regularized estimators can help but not fix the problem.

To do the simulation I am going to use batchtools which is a nice package by Michel Lang, which makes doing simulation experiments easy and replicable, even when they have to be done on a high performance cluster. This package is the successor to BatchExperiments.

devtools::install_github("mllg/batchtools")
library(batchtools)

reg <- makeExperimentRegistry()

addProblem("no_difference",
  fun = function(job, data, n, sigma) rnorm(n, sd = sigma))

First I create a "problem" which is a function which draws a set of samples from a Normal distribution with sample size n and standard deviation sigma. This is to simulate an instance in which the true population effect of an intervention is 0, but we have varying amounts of data to estimate this effect, and the amount of variability in that effect may be larger or smaller depending on the context.

addAlgorithm("bayes", fun = function(job, data, instance) {
  1 / (var(instance) + 1) * mean(instance)
})

addAlgorithm("mle", fun = function(job, data, instance) mean(instance))

Now I am going to compare two algorithms for estimating the true difference (0 by construction), I am going to compute the posterior mean for the Bayesian estimator of the mean with a conjugate Normal prior with mean 0 and variance 1 (which is informative), which minimizes the expected posterior risk with the loss function being mean-square error, and the maximum likelihood estimator of the mean, which is simply the sample mean. Both of these estimators are unbiased in this case (since the prior mean is centered over the true mean), but will have very different variances, and so in any particular experiment, may give very different answers.

addExperiments(
  list("no_difference" = CJ(n = c(10, 20, 50, 100),
    sigma = c(.5, 1, 2, 3))), repls = 100L
)
submitJobs()
results <- reduceResults(function(aggr, res) c(aggr, res))
pars <- getJobPars()
results <- cbind(pars, results)

Now I define the parameter space for the experiment, which is simply the Cartesian product of some values for n and sigma, run the experiments, and collect and summarize the results.

results[, variance := var(results), by = list(algorithm, n, sigma)]

library(ggplot2)
ggplot(results, aes(sigma, variance, color = algorithm)) + geom_line() + facet_wrap(~ n)
ggsave("../static/images/variance.png", width = 8, height = 5, units = "in")

What you can see is that the variance of the Bayesian estimator is much lower because of the informative prior used, which shrinks the empirical mean towards the prior mean of 0.

This helps with the statistical part of the problem because it makes the extreme differences we observe from a small sample size and noisy data less consequential by weighting the estimate of the mean in a way that is proportional to the variance of the data (which is a function of the underlying variability and the sample size).

These same issues apply with more complicated models, and Bayesian methods are obviously not the only way to avoid this problem.