On the Epistemology of Statistics

All of these articles have substantially changed how I think about the topic to which the authors speak. Together also make nearly all of the important ideas in one of my working papers. The details are important though! The articles are in no particular order. If you have an article like this email it to me or link to it on Twitter.

"To Explain or to Predict," Galit Shmueli, Statistical Science (2010)

There are (arguably) three main types of modeling you can do with statistics (machine learning is statistics imo), explanation (causal), prediction, and description/exploration. There is a difference between explanation and prediction, which is due to the disparity between the mapping of concepts to measurements, and the fact that the function that maps the explanatory variables to the response "best" often differs depending on the goal. Causal explanation is dominant in the social sciences, though it is mostly done inappropriately (e.g., associations discovered/estimated by model, causal story comes from theory). Prediction is common in some fields, and description/exploration is mostly done by statisticians (c.f. Gelman's work). These goals imply different things about how predictors are selected for inclusion, what sorts of models should be used, how missingness should be dealt with, etc. Both predictive and descriptive/exploratory modeling should be more common in political science. There was an interesting discussion of this article on CrossValidated as well.

"Statistical Modeling: The Two Cultures," Leo Breiman, Statistical Science (2001)

Data modeling is where you assume a stochastic model and then use data to infer values of the parameters of said model. Algorithmic modeling instead attempts to predict the data without an assumed stochastic model. The focus of the two "camps" (not really camps anymore imo) is inference versus prediction. The latter's importance is *hugely* under-appreciated in political science. He also discusses ways to use "black-box" models (algorithmc models that result in complex, not-directly-interpretable fits) to understand the explore the data as well (which was a major inspiration for this project). Michael Jordan's AMA link to the relevant answer on reddit brings up the (lack of) difference between statistics and machine learning as well.

"Statistical Inference: The Big Picture," Robert Kass, Statistical Science (2011)

Despite the historical frequentist/Bayesian divide, most of statistics is now more pragamatic, and which methods are applied depend more on data analytic concerns than adherence to one camp or the other. Whether frequentist or Bayesian, a stochastic model applied to real data is connecting things from the theoretical world (random variables) to things in the real world (observed data). Inference in both cases is contingent on the closeness of the mapping between the data and the stochastic model. The notion of sampling from an infinite population where that is clearly not possible (which is most of the time) is not such a big deal, since random variables exist in the theoretical world. Although not discussed here a great deal, this is related to the often ignored fact that utilizing the frequentist thought experiment does not mean that you are obliged to talk about tail probabilities (Neyman-Pearson null-hypothesis significance testing). Obviously, then, thinking in this way (imagining repeated sampling from the dgp) in no way should associate you with the misuse of NHST (e.g. dumb null hypotheses, overinterpretation of tail probabilities, multiple testing problems), despite what some people say about this. If everyone were Bayesian in political science we'd still mostly have the same problems. Kass has a bit more about his views on statistical philosophy here.

"Causal Inference in Statistics: An Overview," Judea Pearl, Statistical Surveys (2009)

"[B]ehind any causal conclusion there must be some causal assumption, untested in observational studies." Statistics alone can't get you to causality, hence Pearl's development of the structural causal model (SCM). Confounding, for example, is not a variable that is correlated with the variable of interest and the outcome, it is a cause of both. It has historically been the case that causal inference was treated casually (heh) in political science. I think this is changing (probably thanks to the credibility revolution in economics), such that now people talk about the "endogeneity taliban." "Morgan and Winship's book "Counterfactuals and Causal Inference is an excellent introduction to causal inference generally, and Pearl's approach to it specifically. I think that it is a pretty easy to understand that, given the necessary conditions for causal identification and the apparent complexity of social systems, causal inference in many cases (not all) is not really a possibility. I'm looking at you comparative politics and international relations.