In R, to create a set of new columns, I need to multiply the values of some columns by a value, which is found in a different dataset. My data looks like this:
data1 <- data.frame(
id = seq(1:5),
d01 = c(1.5, 4, 3, 2, 1),
d02 = c(1, 2, 1, 4.5, 3),
d03 = c(2, 4, 3, 2, 5)
)
data2 <- data.frame(
id = c('d01', 'd02', 'd03'),
w = c(2, 4, 1.5)
)
I want to multiply the values in data1$d01
by 2 because in the row in data2
where id == d01
, w == 2
. So my result should look like this:
result <- data.frame(
id = seq(1:5),
d01 = c(1.5, 4, 3, 2, 1),
wd01 = c(3, 8, 6, 4, 2),
d02 = c(1, 2, 1, 4.5, 3),
wd02 = c(4, 8, 4, 18, 12),
d03 = c(2, 4, 3, 2, 5),
wd03 = c(3, 6, 4.5, 3, 7.5)
)
In my real data, data1
has 300.000 rows and this is happening inside a function which I am calling 10.000 times every time creating a different data2
and data1
. Therefore, computational speed is key, and the loop which I can write myself to solve this problem is too slow. The number of rows in data2
(number of columns in data1 + id and the other variables in my real data) varies between each call to the function but is always between 2 and 35, while the number of columns in data2
is always 4 (remaing columns not showed here for simplicity). The columns data1
which I need to multiply always starts with d0
and are the only columns with this naming pattern (same for the values of cells in data2
).
Inside the function I also need to multiply the columns in data1 with values from a different column in data2, so I would prefer to add columns to data1 with the multiplied values, instead of modifying the original columns.
Here's a simple for loop approach. I would be surprised if there are much faster solutions without reshaping your data to try to turn this into, say, matrix multiplication, but the reshaping would probably be more expensive than the efficiency gain for this single operation.
One way to speed things up might be to use data.table
s instead of base data frames. Or perhaps just use matrices--you show no non-numeric data in your example; if a matrix
would work for data1
and data2
were a vector in the right column order, that would probably be a good bit faster.
for(i in 1:nrow(data2)) {
data1[paste0("w", data2$id[i])] = data1[[data2$id[i]]] * data2$w[i]
}
data1
# id d01 d02 d03 wd01 wd02 wd03
# 1 1 1.5 1.0 2 3 4 3.0
# 2 2 4.0 2.0 4 8 8 6.0
# 3 3 3.0 1.0 3 6 4 4.5
# 4 4 2.0 4.5 2 4 18 3.0
# 5 5 1.0 3.0 5 2 12 7.5
Turns out my instincts were wrong, and the for loop method is much faster than a matrix multiplication method, however the lapply
method is even faster by a good bit. Here's a benchmark on "full size" data (results first, benchmarking code follows).
bench::mark(
matrix_mult = matrix_mult(m1, w),
for_loop = for_loop(data1, data2),
lapply = lapply_replace(data1, data2),
pivot = pivot(data1, data2),
check = FALSE
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 matrix_mult 226.35ms 264.33ms 3.78 323.9MB 3.78 2 2 528.65ms <NULL> <Rprofmem [90 × 3]> <bench_tm [2]> <tibble>
# 2 for_loop 47.28ms 52ms 17.4 120.3MB 1.93 9 1 516.81ms <NULL> <Rprofmem [440 × 3]> <bench_tm [9]> <tibble>
# 3 lapply 8.4ms 9.34ms 84.9 81.3MB 5.93 43 3 506.21ms <NULL> <Rprofmem [244 × 3]> <bench_tm [43]> <tibble>
# 4 pivot 1.99s 1.99s 0.503 2.5GB 2.01 1 4 1.99s <NULL> <Rprofmem [737 × 3]> <bench_tm [1]> <tibble>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
Code for benchmarking:
set.seed(47)
nr = 3e5
nc = 35
data1 = data.frame(id = 1:nr, replicate(nc, runif(nr)))
names(data1)[-1] = sprintf("d%02d", 1:nc)
data2 = data.frame(id = sprintf("d%02d", 1:nc), w = runif(nc))
m1 = as.matrix(data1)
w = data2$w
matrix_mult = function(m1, w) {
res = m1[, -1] %*% diag(w)
colnames(res) = paste0("w", colnames(m1)[-1])
cbind(m1, res)
}
for_loop = function(d1, d2) {
for(i in 1:nrow(d2)) {
d1[paste0("w", d2$id[i])] = d1[[d2$id[i]]] * d2$w[i]
}
d1
}
lapply_replace = function(d1, d2) {
d1[paste0("w",names(d1)[-1])] <- lapply(names(d1)[-1], \(x) {
d1[,match(x, d2$id)+1] * d2[match(x, d2$id), "w"]
})
d1
}
pivot = function(d1, d2) {
d1 %>%
# Unpivot to 'id' | 'name' | 'value'.
pivot_longer(!id) %>%
# Match each 'w' to its appropriate column 'name'.
left_join(d2,
join_by(name == id)
) %>%
mutate(
# Multiply by 'w'...
prod = value * w,
# ...and also prefix the column 'name' with "w".
name = paste0("w", name)
) %>%
# Pivot to 'id' | 'wd01' | ... | 'wd03'.
pivot_wider(
id_cols = id,
values_from = prod,
names_from = name
) %>%
# Use 'id' to associate each 'wd*' with its 'd*'.
right_join(d1, "id")
}
collapse::setop
is fast for this application. Timing using @GregorThomas' benchmark:
library(collapse) # for `setop`
collapse_setop <- function(d1, d2) {
cbind(
d1,
setNames(setop(d1[,d2$id], "*", d2$w, rowwise = TRUE), paste0("w", d2$id))
)
}
bench::mark(
lapply = lapply_replace(data1, data2),
collapse_setop = collapse_setop(data1, data2)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 lapply 17.86ms 19.11ms 49.7 81.3MB 137.
#> 2 collapse_setop 5.78ms 6.17ms 158. 28.7KB 0
NB: beware that data1
is modified by reference when using collapse_setop
, but if the operation is data1 <- lapply_replace(data1, data2)
, it won't matter because all the data in data1
is preserved in the output.
As a side note, for a rowwise minimum, using do.call(pmin, data1[,-1])
will probably be faster than as.data.frame(RowMins(as.matrix(data1[,-1)))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With