Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using cummean with group_by and ignoring NAs

Tags:

r

dplyr

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried

df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))

but cummean just doesn't know what to do with NAs.

EDIT: I do not want to count NAs as 0.

like image 476
John F Avatar asked Sep 01 '25 21:09

John F


2 Answers

You could use ifelse to treat NAs as 0 for the cummean call:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category) %>%
  mutate(new_col = cummean(ifelse(is.na(value), 0, value)))

Output:

# A tibble: 8 x 3
# Groups:   category [2]
  category value new_col
  <fct>    <dbl>   <dbl>
1 cat1       NA     0.  
2 cat1        2.    1.00
3 cat2        3.    3.00
4 cat1        4.    2.00
5 cat2        5.    4.00
6 cat2       NA     2.67
7 cat1        7.    3.25
8 cat2        8.    4.00

EDIT: Now I see this isn't the same as ignoring NAs.

Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category, isna = is.na(value)) %>%
  mutate(new_col = ifelse(isna, NA, cummean(value)))

Output:

# A tibble: 8 x 4
# Groups:   category, isna [4]
  category value isna  new_col
  <fct>    <dbl> <lgl>   <dbl>
1 cat1       NA  TRUE    NA   
2 cat1        2. FALSE    2.00
3 cat2        3. FALSE    3.00
4 cat1        4. FALSE    3.00
5 cat2        5. FALSE    4.00
6 cat2       NA  TRUE    NA   
7 cat1        7. FALSE    4.33
8 cat2        8. FALSE    5.33
like image 132
Jack Brookes Avatar answered Sep 04 '25 11:09

Jack Brookes


An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.

df %>% mutate(rn = row_number()) %>%
  filter(!is.na(value)) %>%
  group_by(category) %>%
  mutate(new_col = cummean(value)) %>%
  ungroup() %>% 
  right_join(mutate(df, rn = row_number()), by="rn") %>%
  select(category = category.y, value = value.y, new_col) %>%
  as.data.frame()
#    category value  new_col
# 1     cat1    NA       NA
# 2     cat1     2 2.000000
# 3     cat2     3 3.000000
# 4     cat1     4 3.000000
# 5     cat2     5 4.000000
# 6     cat2    NA       NA
# 7     cat1     7 4.333333
# 8     cat2     8 5.333333
like image 36
MKR Avatar answered Sep 04 '25 12:09

MKR