Multiply the values of columns by a value, which is found in a different dataset

Question

In R, to create a set of new columns, I need to multiply the values of some columns by a value, which is found in a different dataset. My data looks like this:

data1 <- data.frame(
  id  = seq(1:5),
  d01 = c(1.5, 4, 3, 2, 1),
  d02 = c(1, 2, 1, 4.5, 3),
  d03 = c(2, 4, 3, 2, 5)
)

data2 <- data.frame(
  id = c('d01', 'd02', 'd03'),
  w  = c(2, 4, 1.5)
)

I want to multiply the values in data1$d01 by 2 because in the row in data2 where id == d01, w == 2. So my result should look like this:

result <- data.frame(
  id  = seq(1:5),
  d01 = c(1.5, 4, 3, 2, 1),
  wd01 = c(3, 8, 6, 4, 2),
  d02 = c(1, 2, 1, 4.5, 3),
  wd02 = c(4, 8, 4, 18, 12),
  d03 = c(2, 4, 3, 2, 5),
  wd03 = c(3, 6, 4.5, 3, 7.5)
)

In my real data, data1 has 300.000 rows and this is happening inside a function which I am calling 10.000 times every time creating a different data2 and data1. Therefore, computational speed is key, and the loop which I can write myself to solve this problem is too slow. The number of rows in data2 (number of columns in data1 + id and the other variables in my real data) varies between each call to the function but is always between 2 and 35, while the number of columns in data2 is always 4 (remaing columns not showed here for simplicity). The columns data1 which I need to multiply always starts with d0 and are the only columns with this naming pattern (same for the values of cells in data2).

Inside the function I also need to multiply the columns in data1 with values from a different column in data2, so I would prefer to add columns to data1 with the multiplied values, instead of modifying the original columns.

Gregor Thomas · Accepted Answer

Here's a simple for loop approach. I would be surprised if there are much faster solutions without reshaping your data to try to turn this into, say, matrix multiplication, but the reshaping would probably be more expensive than the efficiency gain for this single operation.

One way to speed things up might be to use data.tables instead of base data frames. Or perhaps just use matrices--you show no non-numeric data in your example; if a matrix would work for data1 and data2 were a vector in the right column order, that would probably be a good bit faster.

for(i in 1:nrow(data2)) {
  data1[paste0("w", data2$id[i])] = data1[[data2$id[i]]] * data2$w[i]
}

data1
#   id d01 d02 d03 wd01 wd02 wd03
# 1  1 1.5 1.0   2    3    4  3.0
# 2  2 4.0 2.0   4    8    8  6.0
# 3  3 3.0 1.0   3    6    4  4.5
# 4  4 2.0 4.5   2    4   18  3.0
# 5  5 1.0 3.0   5    2   12  7.5

Turns out my instincts were wrong, and the for loop method is much faster than a matrix multiplication method, however the lapply method is even faster by a good bit. Here's a benchmark on "full size" data (results first, benchmarking code follows).

bench::mark(
  matrix_mult = matrix_mult(m1, w),
  for_loop = for_loop(data1, data2),
  lapply = lapply_replace(data1, data2),
  pivot = pivot(data1, data2),
  check = FALSE
)
  
#   expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time            gc      
#   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>          <list>  
# 1 matrix_mult 226.35ms 264.33ms     3.78    323.9MB     3.78     2     2   528.65ms <NULL> <Rprofmem [90 × 3]>  <bench_tm [2]>  <tibble>
# 2 for_loop     47.28ms     52ms    17.4     120.3MB     1.93     9     1   516.81ms <NULL> <Rprofmem [440 × 3]> <bench_tm [9]>  <tibble>
# 3 lapply         8.4ms   9.34ms    84.9      81.3MB     5.93    43     3   506.21ms <NULL> <Rprofmem [244 × 3]> <bench_tm [43]> <tibble>
# 4 pivot          1.99s    1.99s     0.503     2.5GB     2.01     1     4      1.99s <NULL> <Rprofmem [737 × 3]> <bench_tm [1]>  <tibble>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.

Code for benchmarking:

set.seed(47)
nr = 3e5
nc = 35
data1 = data.frame(id = 1:nr, replicate(nc, runif(nr)))
names(data1)[-1] = sprintf("d%02d", 1:nc)
data2 = data.frame(id = sprintf("d%02d", 1:nc), w = runif(nc))

m1 = as.matrix(data1)
w = data2$w

matrix_mult = function(m1, w) {
  res = m1[, -1] %*% diag(w)
  colnames(res) = paste0("w", colnames(m1)[-1])
  cbind(m1, res)
}

for_loop = function(d1, d2) {
  for(i in 1:nrow(d2)) {
      d1[paste0("w", d2$id[i])] = d1[[d2$id[i]]] * d2$w[i]
  }
  d1
}

lapply_replace = function(d1, d2) {
   d1[paste0("w",names(d1)[-1])] <- lapply(names(d1)[-1], \(x) {
      d1[,match(x, d2$id)+1] * d2[match(x, d2$id), "w"]
    })
   d1
}

pivot = function(d1, d2) {
  d1 %>%
      # Unpivot to 'id' | 'name' | 'value'.
      pivot_longer(!id) %>%
      # Match each 'w' to its appropriate column 'name'.
      left_join(d2,
        join_by(name == id)
      ) %>%
      mutate(
        # Multiply by 'w'...
        prod = value * w,
        # ...and also prefix the column 'name' with "w".
        name = paste0("w", name)
      ) %>%
      # Pivot to 'id' | 'wd01' | ... | 'wd03'. 
      pivot_wider(
        id_cols = id,
        values_from = prod,
        names_from = name
      ) %>%
      # Use 'id' to associate each 'wd*' with its 'd*'.
      right_join(d1, "id")
}

jblood94 · Answer

collapse::setop is fast for this application. Timing using @GregorThomas' benchmark:

library(collapse) # for `setop`

collapse_setop <- function(d1, d2) {
  cbind(
    d1,
    setNames(setop(d1[,d2$id], "*", d2$w, rowwise = TRUE), paste0("w", d2$id))
  )
}
bench::mark(
  lapply = lapply_replace(data1, data2),
  collapse_setop = collapse_setop(data1, data2)
)
#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lapply          17.86ms  19.11ms      49.7    81.3MB     137.
#> 2 collapse_setop   5.78ms   6.17ms     158.     28.7KB       0

NB: beware that data1 is modified by reference when using collapse_setop, but if the operation is data1 <- lapply_replace(data1, data2), it won't matter because all the data in data1 is preserved in the output.

As a side note, for a rowwise minimum, using do.call(pmin, data1[,-1]) will probably be faster than as.data.frame(RowMins(as.matrix(data1[,-1))).

Multiply the values of columns by a value, which is found in a different dataset

Tags:

r

EmilA

2 Answers

Gregor Thomas

jblood94

Recent Activity

Donate For Us

Multiply the values of columns by a value, which is found in a different dataset

Tags:

r

EmilA

2 Answers

Gregor Thomas

jblood94

Related questions

Recent Activity

Donate For Us