Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting a dataframe based on summation of rows of a given column

Tags:

r

I am dealing with data with three variables (i.e. id, time, gender). It looks like

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
      time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
      gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-12L)
  )

That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
      time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
      gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-8L)
  )

Any help on this is highly appreciated.

like image 921
T Richard Avatar asked Feb 04 '26 07:02

T Richard


1 Answers

One option is using lag of cumsum as:

library(dplyr)

df %>% group_by(id,gender) %>%
  filter(lag(cumsum(time), default = 0) < 25 )

# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id  time gender
# <int> <int>  <int>
# 1     1    21      1
# 2     1     3      1
# 3     1     4      1
# 4     2     5      0
# 5     2     9      0
# 6     2    10      0
# 7     2     6      0
# 8     3    27      1

Using data.table: (Updated based on feedback from @Renu)

library(data.table)

setDT(df)

df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
like image 179
MKR Avatar answered Feb 05 '26 23:02

MKR