Data Mining as Exploratory Data Analysis

Thanks to Fridolin Linder and Chris Fariss for comments.

Cyrus Samii has a post that is germane to this topic as well. The NBER's 2013 Summer School on Econometric Methods for High-Dimensional Data has some relevant lectures on data mining as well: part 1, 2, 3, and slides. Hal Varian, Leo Breiman, and Richard Berk all have papers on this topic.

In political science observational data is common. Making causal inferences from observational data is difficult because, absent credibly exogenous variation (w.r.t. the variable we'd like to know the true effect of), we can never be sure of unconfoundness. Unconfoundness is unverifiable and it seems unlikely to hold in most situations. Social systems are often dizzyingly complex and confounding can occur regardless of whether theory says it should.¹ It is possible to use partial identification to probe how sensitive estimates are to confounding, but, at least in political science, this is rare (though, see Keele and Minozzi 2013).

Two additional complications arise. First, as we all know, social reality is malleable. Performativity occurs in experimental and observational settings. There is variation in the magnitude, direction, and variance of causal effects in social systems across time and place.² Reification still occurs, though it is perhaps less brash than it once was. Second, many observational studies, despite using causal language, don't have a clean mapping to an idealized experiment. Instead the aim is to assess which variables are most important (implicitly, in predicting the outcome). Often there are a number of measures of interest, which are, along with a number of "control" variables, regressed on the outcome variable. In these settings it is apparent that the purpose of statistical control is misunderstood.³ Control is for adjusting treatment estimates that are biased due to confounding.⁴

What are we to do when we want to assess the importance of a number of measures with only observational data available? The typical approach is to specify a regression model in which the measures enter the equation in a particular form specified by the analyst. The model is fit to the entire set of available data, which may or may not be collected via a probabilistic mechanism.⁵ The importance of each predictor is assessed by null hypothesis significance testing (NHST), which is rather inappropriate in many situations. Fitting the model to the entire set of available data makes over-fitting easy. The way regression models are specified suggests that the analyst has a strong idea about what the functional form of the relationship between each explanatory variable and the response looks like as well as the form of the relationship between arbitrary groupings of the explanatory variables and the response. It is rarely the case that theory is both specific and reliable enough to justify this.

Data mining is, in social science, a pejorative term that refers to estimating a number of models in search of "interesting" or "counter-intuitive" results, where "results" refers to regression coefficients that are statistically significant: $p$-hacking. Data mining outside of social science does not have this meaning. Data mining is synonymous with machine/statistical learning and pattern recognition, which are a part of artificial intelligence. These methods are designed for automated knowledge discovery: discovering relations between features of data (variables) without the analyst specifying the function mapping the features to the response (supervised learning) or, with unlabeled data, where there is no "response" variable, the relations between the features (unsupervised learning). Data mining in the social science sense is bad because it is done with methods that have poor statistical properties in that role and because we sometimes pretend that we can make causal inferences based on these models. I should note that it is of course the case that theory is driving what questions we are asking of the data, what measures we have constructed, as well as interpretation of the results. However, we often seem to act like theory is more specific and reliable than it probably is.

Using statistical learning is preferable to the NHST based recipe described above. There are machine learning algorithms (models) that can be used for both regression and classification. They can help us answer questions like "which variables best predict instances of state repression" while pre-supposing less about the structure of the data. The analyst doesn't have to assume linearity and additivity or explicitly model deviations from it: these features can be learned without analyst input.⁶ This is data mining with (potentially) good statistical properties. If we do data mining with methods designed for this purpose and treat the results as a form of exploratory data analysis (EDA) I think data mining could cease being a pejorative term and start being helpful.⁷ Confounding, post-treatment bias, selection bias, misspecification, causal effect heterogeneity, and measurement error don't go away when we are data mining, but estimates from these models are not causal.⁸ What they can do is point us in the right direction. It is up to us to decide, on the basis of theory, whether a variable's importance is due to one of the aforementioned forms of bias.

A recent paper I wrote with Danny Hill attempts to answer the question posed above in this way. What we did, briefly, is take a large set of variables which have been used to operationalize various explanations for state repression, a number of measures of state repression, and a set of algorithms designed for approximating (learning) the function mapping the feature space (input variables) to the response. These explanatory variables are the result of many years of work by dozens of scholars, as are the measures of state repression we used (for which we are very grateful). Though we considered a large set of explanatory variables (30+), that we have these measures and not other measures is the result of theory. We then assess which explanations do the best at predicting our various measures of state repression without presuming that the input variables do or do not interact or what the relationship between them and the response looks like. We found that many of the most common explanations for state repression weren't very powerful, and a few under-studied measures are quite powerful. There were a number of interesting patterns in our findings that we think might be helpful to scholars in this area. I've written a summary of this paper that I'll post when/if it gets published.

When we have observational data, no (as-if) randomization, and lack a clear mapping to an idealized experiment, I think we should default to flexible, inductive methods. I think these three conditions are true in a very large number of studies, particularly in international relations and comparative politics. Our use of restrictive regression models is inappropriate, especially when we ignore model validation. Regularization and validation are topics for another day though.

Send me an email or let me know on Twitter if you have any comments on the topic.

Many times only the sign and statistical significance of a regression coefficient is discussed: giving a qualitative, discrete interpretation to a continuous quantity. This suggests that many analysts don't believe the effects they have estimated. ↩
This is one reason why the failure of many studies to replicate shouldn't be surprising. The inability to even reproduce analyses is more troubling imo. Some causal effects are probably less variable than others. ↩
It is frequently the case that a variables' previous association with the outcome variable is used as justification for its inclusion as a control. In many areas there are also "standard" control variables. ↩
Also, you have to get the functional form for each control (and group of controls) correct to get an unbiased treatment effect. ↩
Without sampling or random assignment to treatment frequentist inference doesn't make any sense. Sampling from an imaginary "super-population" is a weak justification. Though, see Cyrus Samii on this as well. ↩
As Jeff Arnold pointed out, there is this weird idea in social science that you are making fewer assumptions when you assume linearity and that the effect of most variables is 0. ↩
Outside of social science data mining is not at all pejorative. ↩
Estimates from regression models aren't either and my argument is that they are usually much worse. Additionally they are often presented in a way that encourages causal interpretation. ↩

Zach Jones

Data Mining as Exploratory Data Analysis