I often want to find the unique combinations of some grouping variables in a data table. With R + dplyr, my normal workflow is to combine groupby(data, var1, var2, var3) %>% summarise, which returns a new table with the columns var1, var2, var3, with one row for each unique combination of values found in data.
What's the idiomatic way to do this in DataFrames.jl?
In DataFrames.jl, a DataFrame is a collection of rows. So the right mental model here is to first select only the columns you care about, then get the unique rows from that table, as in
select(data, [:var1, :var2, :var3]) |> unique!
(Or if you hate the pipe/love extra parens:
unique!(select(data, [:var1, :var2, :var3]))
unique! is recommended here because select makes a copy of the underlying columns. Alternatively, you could use a view or indexing, but these require unique (which does not mutate the underlying column vectors) so as not to corrupt the original data frame:
unique(data[!, [:var1, :var2, :var3]])
unique(view(data, :, [:var1, :var2, :var3]))
Alternatively you can write:
keys(groupby(data, [:var1, :var2, :var3]))
to get a vector of unique grouping keys. Then you can collect them to a DataFrame if you want by writing:
groupby(data, [:var1, :var2, :var3]) |> keys |> DataFrame
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With