Update/Benchmark of edarf

edarf is a package I wrote for doing exploratory data analysis using random forests. One of its core purposes is to compute the partial dependence of input variables/covariates/features, on a fitted random forests predictions for a target/outcome variable. The way that this is computed is by marginalizing the prediction function (which assumes that data came from a uniform product distribution).

The original version of the package (the CRAN version as I write this), was implemented in a relatively naive manner. The newer version (the github version as I write this) is substantially faster.

There have been some performance improvements that make the exact same computation a little more than 50%, as you can see below.

library(devtools)

## older version as of JOSS publication
install_github("zmjones/edarf", subdir = "pkg", ref = "joss", force = TRUE)

library(edarf)
library(randomForest)
library(microbenchmark)

n = 1000
m = 100
p = 10
X = replicate(p, rnorm(n))
alpha = runif(p)
data = data.frame(X, y = X %*% alpha + rnorm(n))
fit = randomForest(y ~ ., data)

old = microbenchmark(partial_dependence(fit, data, "X1"), times = m)

## latest github version
install_github("zmjones/edarf", subdir = "pkg")

new = microbenchmark(partial_dependence(fit, "X1", c(10, n), data = data), times = m)

mean(old$time) / mean(new$time)
## [1] 1.563618

The new version of edarf relies on another new package I wrote, mmpf (Monte Carlo Methods for Prediction Functions). mmpf is much more flexible than edarf but somewhat less user-friendly. edarf now calls mmpf for all computation (it automatically sets some arguments required for mmpf).

As a result of this dependency, edarf can now subsample the points of integration from the data, which can result in additional speedups. This can be controlled by modifying the second element of the argument n, which controls the resolution of the grid that is used for marginalization. So here if we used 250 randomly sampled points instead of 1000, we would see a total decrease in computation time by a factor of a 5.3. Compared to the case where I used all of the data, the improvement is by a factor of 3.4. Note that this does necessarily make the estimate more variable.

new_subsample = microbenchmark(partial_dependence(fit, "X1", c(10, n * .25),
  data = data), times = m)

mean(old$time) / mean(new_subsample$time)
## [1] 5.334388
mean(new$time) / mean(new_subsample$time)
## [1] 3.411568

Please open an issue or email me if you have any issues with using the newer version!

If you use edarf in your research, please cite it!

@article{jones2016,
  doi = {10.21105/joss.00092},
  url = {http://dx.doi.org/10.21105/joss.00092},
  year  = {2016},
  month = {oct},
  publisher = {The Open Journal},
  volume = {1},
  number = {6},
  author = {Zachary M. Jones and Fridolin J. Linder},
  title = {edarf: Exploratory Data Analysis using Random Forests},
  journal = {The Journal of Open Source Software}
}