edarf is a package I wrote for doing exploratory data analysis using random forests. One of its core purposes is to compute the partial dependence of input variables/covariates/features, on a fitted random forests predictions for a target/outcome variable. The way that this is computed is by marginalizing the prediction function (which assumes that data came from a uniform product distribution).

The original version of the package (the CRAN version as I write this), was implemented in a relatively naive manner. The newer version (the github version as I write this) is substantially faster.

There have been some performance improvements that make the exact same computation a little more than 50%, as you can see below.

library(devtools) ## older version as of JOSS publication install_github("zmjones/edarf", subdir = "pkg", ref = "joss", force = TRUE) library(edarf) library(randomForest) library(microbenchmark) n = 1000 m = 100 p = 10 X = replicate(p, rnorm(n)) alpha = runif(p) data = data.frame(X, y = X %*% alpha + rnorm(n)) fit = randomForest(y ~ ., data) old = microbenchmark(partial_dependence(fit, data, "X1"), times = m) ## latest github version install_github("zmjones/edarf", subdir = "pkg") new = microbenchmark(partial_dependence(fit, "X1", c(10, n), data = data), times = m) mean(old$time) / mean(new$time)

## [1] 1.563618

The new version of edarf relies on another new package I wrote, mmpf (**M**onte **C**arlo **M**ethods for **P**rediction **F**unctions). `mmpf`

is much more flexible than `edarf`

but somewhat less user-friendly. `edarf`

now calls `mmpf`

for all computation (it automatically sets some arguments required for `mmpf`

).

As a result of this dependency, `edarf`

can now subsample the points of integration from the data, which can result in additional speedups. This can be controlled by modifying the second element of the argument `n`

, which controls the resolution of the grid that is used for marginalization. So here if we used 250 randomly sampled points instead of 1000, we would see a total decrease in computation time by a factor of a 5.3. Compared to the case where I used all of the data, the improvement is by a factor of 3.4. Note that this does necessarily make the estimate more variable.

new_subsample = microbenchmark(partial_dependence(fit, "X1", c(10, n * .25), data = data), times = m) mean(old$time) / mean(new_subsample$time)

## [1] 5.334388

mean(new$time) / mean(new_subsample$time)

## [1] 3.411568

Please open an issue or email me if you have any issues with using the newer version!

If you use `edarf`

in your research, please cite it!

@article{jones2016, doi = {10.21105/joss.00092}, url = {http://dx.doi.org/10.21105/joss.00092}, year = {2016}, month = {oct}, publisher = {The Open Journal}, volume = {1}, number = {6}, author = {Zachary M. Jones and Fridolin J. Linder}, title = {edarf: Exploratory Data Analysis using Random Forests}, journal = {The Journal of Open Source Software} }