Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: summarize by group (sum) is extremely slow

Tags:

r

dplyr

I have two datasets: inds (individuals) and hhs (households).

I'm trying to sum all inds$income grouped by UID (unique ID) and create a new column on hhs with the sum of the income of all individuals in a households. Some individuals don't have any income, in which case they have 'NA' for that variable. The code I'm using is:

hhs <- left_join(hhs, inds %>% group_by(UID) %>% summarize(hhincome = sum(income, na.rm=TRUE)))

However, it's extremely slow. inds has over 2 million rows, hhs has about 550k rows. I've used dplyr to average or count similar variables in the same datasets, and it usually takes ten seconds or less to get this done. Is there anything I'm doing wrong? Is there any way to make it faster?

Edit: as I said, I've used dplyer to get mean household values without a problem with

hhs <- left_join(hhs, inds %>% filter(AGE > 2) %>% group_by(UID) %>% summarize(L_Bilingual = mean(Bilingual, na.rm=TRUE)))

It takes 5 seconds to calculate the means with the code above. Is there anything particular to the function sum() that makes it slower?

like image 665
Werner Avatar asked Oct 17 '25 16:10

Werner


2 Answers

With the fake data below, the summarize-and-join takes about 2 seconds on my machine, which is a new Macbook Pro. Even with a slower machine, it shouldn't take longer than maybe 10 or 15 seconds. Is it possible that the join is creating many more rows than you think? If you provide more details about the structure of your data, we can be more specific about what might be going wrong.

library(tidyverse)
library(microbenchmark)

# Generate two data frames with only the UID column in common
set.seed(2)
hhs = data.frame(UID=1:550000, replicate(30, runif(550000))) %>% 
  set_names(c("UID", paste0("V",1:30)))
inds = data.frame(UID=sample(1:550000, 2e6, replace=TRUE), 
                  income=rnorm(2e6, 5e4, 1e4),
                  replicate(20, rnorm(2e6, 5e4, 1e4)))

microbenchmark(join=left_join(hhs, inds %>% group_by(UID) %>% 
                                summarize(hhincome = sum(income, na.rm=TRUE))),
               times=5)
Unit: seconds
 expr      min       lq     mean median       uq      max neval
 join 1.924749 1.988773 2.722018 2.0063 2.068044 5.622223     5
like image 78
eipi10 Avatar answered Oct 19 '25 05:10

eipi10


Here's what was making it slow: I hadn't realized that inds$income had labels:

> head(inds$income)
<Labelled double>: Earned income
[1]      0      0      0      0 258000      0

Labels:
value                  label
 99999998       Unknown/missing.
 99999999 NIU (not in universe).

Problem was fixed when I removed the labels (recoding the column with as.numeric()).

like image 22
Werner Avatar answered Oct 19 '25 06:10

Werner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!