In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where consecutive x values are the same). For instance, consider the following data frame, where I want to compute the average y value within each run of x:
(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
#   x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7
In this example, the x variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y in those groups are 2, 4.5, 6, and 7.
It is easy to carry out this grouped operation in base R using tapply, passing dat$y as the data, using rle to compute the run number from dat$x, and passing the desired summary function:
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
#   1   2   3   4 
# 2.0 4.5 6.0 7.0 
I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:
library(dplyr)
# First attempt
dat %>%
  group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
  summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'
# Attempt 2 -- maybe "with" is the problem?
dat %>%
  group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
  summarize(mean(y))
# Error: invalid subscript type 'closure'
For completeness, I could reimplement the rle run id myself using cumsum, head, and tail to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:
dat %>%
  group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%
  summarize(mean(y))
#     run mean(y)
#   (dbl)   (dbl)
# 1     1     2.0
# 2     2     4.5
# 3     3     6.0
# 4     4     7.0
What is causing my rle-based grouping code to fail in dplyr, and is there any solution that enables me to keep using rle when grouping by run id?
One option seems to be the use of {} as in:
dat %>%
    group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
    summarize(mean(y))
#Source: local data frame [4 x 2]
#
#     yy mean(y)
#  (int)   (dbl)
#1     1     2.0
#2     2     4.5
#3     3     6.0
#4     4     7.0
It would be nice if future dplyr versions also had an equivalent of data.table's rleid function.
I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input:
dat %>% 
    tbl_df %>% 
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'
dat %>% 
    tbl_dt %>% 
    group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
    summarize(mean(y))
Source: local data table [4 x 2]
     yy mean(y)
  (int)   (dbl)
1     1     2.0
2     2     4.5
3     3     6.0
4     4     7.0
I reported this as an issue on dplyr's github page.
If you explicitly create a grouping variable g it more or less works:
> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%                                   
 group_by(g) %>% summarize(mean(y))
Source: local data frame [4 x 2]
      g mean(y)
  (int)   (dbl)
1     1     2.0
2     2     4.5
3     3     6.0
4     4     7.0
I used transform here because mutate throws an error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With