edarf is a package I wrote for doing exploratory data analysis using random forests. One of its core purposes is to compute the partial dependence of input variables/covariates/features, on a fitted random forests predictions for a target/outcome variable. The way that this is computed is by marginalizing the prediction function (which assumes that data came from a uniform product distribution).

The original version of the package (the CRAN version as I write this), was implemented in a relatively naive manner. The newer version (the github version as I write this) is substantially faster.

There have been some performance improvements that make the exact same computation a little more than 50%, as you can see below.

```
library(devtools)
## older version as of JOSS publication
install_github("zmjones/edarf", subdir = "pkg", ref = "joss", force = TRUE)
library(edarf)
library(randomForest)
library(microbenchmark)
n = 1000
m = 100
p = 10
X = replicate(p, rnorm(n))
alpha = runif(p)
data = data.frame(X, y = X %*% alpha + rnorm(n))
fit = randomForest(y ~ ., data)
old = microbenchmark(partial_dependence(fit, data, "X1"), times = m)
## latest github version
install_github("zmjones/edarf", subdir = "pkg")
new = microbenchmark(partial_dependence(fit, "X1", c(10, n), data = data), times = m)
mean(old$time) / mean(new$time)
```

```
## [1] 1.563618
```

The new version of edarf relies on another new package I wrote, mmpf (**M**onte **C**arlo **M**ethods for **P**rediction **F**unctions). `mmpf`

is much more flexible than `edarf`

but somewhat less user-friendly. `edarf`

now calls `mmpf`

for all computation (it automatically sets some arguments required for `mmpf`

).

As a result of this dependency, `edarf`

can now subsample the points of integration from the data, which can result in additional speedups. This can be controlled by modifying the second element of the argument `n`

, which controls the resolution of the grid that is used for marginalization. So here if we used 250 randomly sampled points instead of 1000, we would see a total decrease in computation time by a factor of a 5.3. Compared to the case where I used all of the data, the improvement is by a factor of 3.4. Note that this does necessarily make the estimate more variable.

```
new_subsample = microbenchmark(partial_dependence(fit, "X1", c(10, n * .25),
data = data), times = m)
mean(old$time) / mean(new_subsample$time)
```

```
## [1] 5.334388
```

```
mean(new$time) / mean(new_subsample$time)
```

```
## [1] 3.411568
```

Please open an issue or email me if you have any issues with using the newer version!

If you use `edarf`

in your research, please cite it!

```
@article{jones2016,
doi = {10.21105/joss.00092},
url = {http://dx.doi.org/10.21105/joss.00092},
year = {2016},
month = {oct},
publisher = {The Open Journal},
volume = {1},
number = {6},
author = {Zachary M. Jones and Fridolin J. Linder},
title = {edarf: Exploratory Data Analysis using Random Forests},
journal = {The Journal of Open Source Software}
}
```