I'm interested in using dplyr to construct bootstrap replications (repeated analyses where the data is first sampled with replacement each time). Hadley Wickham here provides some code for repeating bootstrapped analyses in an efficient way:
bootstrap <- function(df, m) {
n <- nrow(df)
attr(df, "indices") <- replicate(m, sample(n, replace = TRUE),
simplify = FALSE)
attr(df, "drop") <- TRUE
attr(df, "group_sizes") <- rep(n, m)
attr(df, "biggest_group_size") <- n
attr(df, "labels") <- data.frame(replicate = 1:m)
attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
df
}
library(dplyr)
mboot <- bootstrap(mtcars, 10)
# Works
mboot %.% summarise(mean(cyl))
While this function works well for summarise, it doesn't work for do when do contains a data.frame. (Imagine for now that the data.frame contains something useful such as the results of the analysis we wish to bootstrap).
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Error: index out of bounds
with the traceback
11: stop(list(message = "index out of bounds", call = NULL, cppstack = NULL))
10: .Call("dplyr_grouped_df_impl", PACKAGE = "dplyr", data, symbols,
drop)
9: grouped_df_impl(data, unname(vars), drop)
8: grouped_df(cbind_list(labels, out), groups)
7: label_output_dataframe(labels, out, groups(.data))
6: do.grouped_df(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
5: do(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
4: eval(expr, envir, enclos)
3: eval(e, env)
2: withVisible(eval(e, env))
1: bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))
I was able to work around this by performing two do steps and a group by:
bootstrap(mtcars, 10) %>% do(d=data.frame(x=1:2)) %>% group_by(replicate) %>% do(.$d[[1]])
but this seems to require a lot of extra, and somewhat clumsy, steps (and also gets a warning, Grouping rowwise data frame strips rowwise nature). I'm also aware that I could replicate the data into ten replications first with something like
data.frame(boot=1:10) %>% group_by(boot) %>% do(sample_n(mtcars, nrow(mtcars), replace=TRUE))
but if the data or the number of bootstrap replicates is large this is extremely inefficient in memory.
Is there a way, perhaps by altering the bootstrap setup function, that I can perform these replicates with bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))?
I think it is a small bug in the bootstrap function. The vars attribute should match the column name in the data.frame in the labels attribute. But in the function, the vars attribute is called "boot", and the column name is replicate. So, if you make this minor change:
bootstrap <- function(df, m) {
n <- nrow(df)
attr(df, "indices") <- replicate(m, sample(n, replace = TRUE),
simplify = FALSE)
attr(df, "drop") <- TRUE
attr(df, "group_sizes") <- rep(n, m)
attr(df, "biggest_group_size") <- n
attr(df, "labels") <- data.frame(replicate = 1:m)
attr(df, "vars") <- list(quote(replicate)) # Change
# attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
df
}
Then it works as expected:
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Source: local data frame [6 x 2]
# Groups: replicate
# replicate x
# 1 1 1
# 2 1 2
# 3 2 1
# 4 2 2
# 5 3 1
# 6 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With