Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping by consecutive value occurrences

Tags:

r

dplyr

I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.

I want to group rows based on consecutive observations of the same value. For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.

The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.

library(tidyverse)

# Crete some test data
df <- tibble(
  id = 1:20,
  type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
  val = runif(20)
)

df
#> # A tibble: 20 x 3
#>       id type     val
#>    <int> <chr>  <dbl>
#>  1     1 a     0.0606
#>  2     2 a     0.501 
#>  3     3 a     0.974 
#>  4     4 a     0.0833
#>  5     5 a     0.752 
#>  6     6 b     0.0450
#>  7     7 b     0.367 
#>  8     8 b     0.649 
#>  9     9 b     0.846 
#> 10    10 b     0.896 
#> 11    11 a     0.178 
#> 12    12 a     0.295 
#> 13    13 a     0.206 
#> 14    14 a     0.233 
#> 15    15 a     0.851 
#> 16    16 b     0.179 
#> 17    17 b     0.801 
#> 18    18 b     0.326 
#> 19    19 b     0.269 
#> 20    20 b     0.584

# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
  current <- df$type[i]
  lag <- ifelse(i == 1, NA, df$type[i - 1])
  lead <- ifelse(i == nrow(df), NA, df$type[i + 1])

  if (lead %>% is.na) {
    df$consec_group[i] <- ifelse(current == lag, count, count + 1) 
  } else {
    df$consec_group[i] <- count 
    if (current != lead) count <- count + 1
  }
}

df
#> # A tibble: 20 x 4
#>       id type     val consec_group
#>    <int> <chr>  <dbl>        <dbl>
#>  1     1 a     0.0606            1
#>  2     2 a     0.501             1
#>  3     3 a     0.974             1
#>  4     4 a     0.0833            1
#>  5     5 a     0.752             1
#>  6     6 b     0.0450            2
#>  7     7 b     0.367             2
#>  8     8 b     0.649             2
#>  9     9 b     0.846             2
#> 10    10 b     0.896             2
#> 11    11 a     0.178             3
#> 12    12 a     0.295             3
#> 13    13 a     0.206             3
#> 14    14 a     0.233             3
#> 15    15 a     0.851             3
#> 16    16 b     0.179             4
#> 17    17 b     0.801             4
#> 18    18 b     0.326             4
#> 19    19 b     0.269             4
#> 20    20 b     0.584             4

Created on 2019-03-14 by the reprex package (v0.2.1)

This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.

like image 955
djfinnoy Avatar asked Jan 19 '26 22:01

djfinnoy


2 Answers

You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...

library(data.table)
setDT(df)[, groupid := rleid(type)][]

#     id type         val groupid
#  1:  1    a 0.624078793       1
#  2:  2    a 0.687361541       1
#  3:  3    a 0.817702740       1
#  4:  4    a 0.669857208       1
#  5:  5    a 0.100977936       1
#  6:  6    b 0.418275823       2
#  7:  7    b 0.660119857       2
#  8:  8    b 0.876015209       2
#  9:  9    b 0.473562143       2
# 10: 10    b 0.284474633       2
# 11: 11    a 0.034154862       3
# 12: 12    a 0.391760387       3
# 13: 13    a 0.383107868       3
# 14: 14    a 0.729583433       3
# 15: 15    a 0.006288375       3
# 16: 16    b 0.530179235       4
# 17: 17    b 0.802643704       4
# 18: 18    b 0.409618633       4
# 19: 19    b 0.309363642       4
# 20: 20    b 0.021918512       4

If you insist on using the tidyverse/dplyr, you can (of course) still use the rleid-function as follows:

df %>% mutate( groupid = data.table::rleid(type) )

benchmarks

on a larger sample

library(tidyverse)
library(data.table)

# Crete some large test data
df <- tibble(
  id = 1:200000,
  type = sample(letters[1:26], 200000, replace = TRUE),
  val = runif(200000)
)

dt <- as.data.table(df)

microbenchmark::microbenchmark(
  dplyr.rleid      = df %>% mutate( groupid = data.table::rleid(type) ),
  data.table.rleid = dt[, groupid := rleid(type)][], 
  rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
  rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
  transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
  times = 10)

# Unit: milliseconds
#             expr       min        lq      mean    median        uq        max neval
#      dplyr.rleid  3.153626  3.278049  3.410363  3.444949  3.502792   3.582626    10
# data.table.rleid  2.965639  3.065959  3.173992  3.145643  3.259672   3.507009    10
#              rle 13.059774 14.042797 24.364176 26.126176 29.460561  36.874054    10
#             rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719    10
#        transform 12.330717 22.419128 22.725242 25.532084 26.187634  26.702794    10
like image 197
Wimpel Avatar answered Jan 22 '26 11:01

Wimpel


You can use a rleid()-like possibility like this:

df %>%
 mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})

      id type     val ID_rleid
   <int> <chr>  <dbl>    <int>
 1     1 a     0.0430        1
 2     2 a     0.858         1
 3     3 a     0.504         1
 4     4 a     0.318         1
 5     5 a     0.469         1
 6     6 b     0.144         2
 7     7 b     0.173         2
 8     8 b     0.0706        2
 9     9 b     0.958         2
10    10 b     0.557         2
11    11 a     0.358         3
12    12 a     0.973         3
13    13 a     0.982         3
14    14 a     0.177         3
15    15 a     0.599         3
16    16 b     0.627         4
17    17 b     0.454         4
18    18 b     0.682         4
19    19 b     0.690         4
20    20 b     0.713         4

Or a modification (originally proposed by @d.b) that makes it more handy:

df %>%
 mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))
like image 33
tmfmnk Avatar answered Jan 22 '26 10:01

tmfmnk