Using cummean with group_by and ignoring NAs

Question

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried

df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))

but cummean just doesn't know what to do with NAs.

EDIT: I do not want to count NAs as 0.

Jack Brookes · Accepted Answer

You could use ifelse to treat NAs as 0 for the cummean call:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category) %>%
  mutate(new_col = cummean(ifelse(is.na(value), 0, value)))

Output:

# A tibble: 8 x 3
# Groups:   category [2]
  category value new_col
  <fct>    <dbl>   <dbl>
1 cat1       NA     0.  
2 cat1        2.    1.00
3 cat2        3.    3.00
4 cat1        4.    2.00
5 cat2        5.    4.00
6 cat2       NA     2.67
7 cat1        7.    3.25
8 cat2        8.    4.00

EDIT: Now I see this isn't the same as ignoring NAs.

Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category, isna = is.na(value)) %>%
  mutate(new_col = ifelse(isna, NA, cummean(value)))

Output:

# A tibble: 8 x 4
# Groups:   category, isna [4]
  category value isna  new_col
  <fct>    <dbl> <lgl>   <dbl>
1 cat1       NA  TRUE    NA   
2 cat1        2. FALSE    2.00
3 cat2        3. FALSE    3.00
4 cat1        4. FALSE    3.00
5 cat2        5. FALSE    4.00
6 cat2       NA  TRUE    NA   
7 cat1        7. FALSE    4.33
8 cat2        8. FALSE    5.33

MKR · Answer

An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.

df %>% mutate(rn = row_number()) %>%
  filter(!is.na(value)) %>%
  group_by(category) %>%
  mutate(new_col = cummean(value)) %>%
  ungroup() %>% 
  right_join(mutate(df, rn = row_number()), by="rn") %>%
  select(category = category.y, value = value.y, new_col) %>%
  as.data.frame()
#    category value  new_col
# 1     cat1    NA       NA
# 2     cat1     2 2.000000
# 3     cat2     3 3.000000
# 4     cat1     4 3.000000
# 5     cat2     5 4.000000
# 6     cat2    NA       NA
# 7     cat1     7 4.333333
# 8     cat2     8 5.333333

Using cummean with group_by and ignoring NAs

Tags:

r

dplyr

John F

2 Answers

Jack Brookes

MKR

Recent Activity

Donate For Us

Using cummean with group_by and ignoring NAs

Tags:

r

dplyr

John F

2 Answers

Jack Brookes

MKR

Related questions

Recent Activity

Donate For Us