Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.

Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.

Reproducible example:

library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40, 
        col_types = paste0(rep("d", 40), collapse = "")) %>%
  rowwise() %>%
  mutate(mean = mean(c_across()),
         var = var(c_across()))

My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.

like image 414
Taylor F Avatar asked Dec 05 '25 06:12

Taylor F


1 Answers

It seems like the processing time for the rowwise approach explodes quadratically:

enter image description here

Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise approach.

df %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(-row) %>%
  group_by(row) %>%
  summarize(mean = mean(value),
            var = var(value)) %>%
  bind_cols(df, .)
like image 57
Jon Spring Avatar answered Dec 07 '25 04:12

Jon Spring