My question involves the distinct function from dplyr.
First, set up the data:
set.seed(0)
df <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
Consider the following two uses of distinct.
df %>%
group_by(x) %>%
distinct()
df %>%
group_by(x) %>%
distinct(y)
The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".
Why should this be so when
df %>%
distinct(x, y)
df %>% distinct()
produce the same result?
EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110
As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.
Thus:
df %>%
group_by(x) %>%
distinct()
Group by x, find values that are distinct in x(!). This seems to be a bug.
However:
df %>%
group_by(x) %>%
distinct(y)
Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:
df %>%
distinct(x, y)
df %>% distinct()
Both find distinct values in x and y.
The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With