I am grouping by an identifier (the identifier may have multiple rows) and then attempting to make a unique row per identifier by using a group_by
+ summarize
with a case_when
. The idea is if there are 2 or more rows then return a single preset value otherwise return the original value. The problem is that case_when returns duplicate rows with the preset value. How do I make sure the output is unique? I could use a distinct
but my expectation would be that the value is already unique. dplyr v. 1.0.7
library(tidyverse)
dat <- tibble(
id = c(1, 1, 2, 3, 4, 4, 5),
fav_color = c('red', 'blue', 'red', 'green', 'white', 'black', 'pink')
)
dat %>%
group_by(id) %>%
summarize(
fav_color = case_when(
n() > 1 ~ 'Multiple Colors',
n() == 1 ~ fav_color
)
)
Returns
# A tibble: 7 x 2
# Groups: id [5]
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 1 Multiple Colors
3 2 red
4 3 green
5 4 Multiple Colors
6 4 Multiple Colors
7 5 pink
What I want ifelse
gives:
dat %>%
group_by(id) %>%
summarize(
fav_color = ifelse(n() > 1, 'Multiple Colors', fav_color)
)
# A tibble: 5 x 2
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 2 red
3 3 green
4 4 Multiple Colors
5 5 pink
Note Using a character output for the final return results in the correct number of rows:
dat %>%
group_by(id) %>%
summarize(
fav_color = case_when(
n() > 1 ~ 'Multiple Colors',
n() == 1 ~ 'Single Color'
)
)
# A tibble: 5 x 2
id fav_color
<dbl> <chr>
1 1 Multiple Colors
2 2 Single Color
3 3 Single Color
4 4 Multiple Colors
5 5 Single Color
The issue with case_when
is that it evaluates all the outputs for all the values even when the condition is not satisfied.
Consider this example -
library(dplyr)
a <- c(1, 2, 3, 4)
case_when(a > 0 ~ a * 2,
a < 0 ~ 'a' * 2)
Error in "a" * 2 : non-numeric argument to binary operator
So in your case n() == 1 ~ fav_color
is executed for every group. Since the groups which have more than one row will have more than one value of fav_color
it maintains the number of rows in the output as the original dataframe. The solution is to use a summarising function as @theQi suggested that will always return an output of length 1.
dat %>%
group_by(id) %>%
summarize(
fav_color1 = case_when(
n() > 1 ~ 'Multiple Colors',
n() == 1 ~ fav_color[1]
)
)
# A tibble: 5 x 2
# id fav_color1
# <dbl> <chr>
#1 1 Multiple Colors
#2 2 red
#3 3 green
#4 4 Multiple Colors
#5 5 pink
Not relevant to the question but something worth knowing is fcase
from data.table
is different. It evaluates the output only when it is needed.
a <- c(1, 2, 3, 4)
data.table::fcase(a > 0, a * 2,
a < 0, 'a' * 2)
#[1] 2 4 6 8
case_when
is not designed to deal with this, according to a Tidyverse developer (Github issue #5730).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With