I'm working with a data set that contains information on anti-human trafficking organizations. The organizations are identified by either the organization names or the web address of the organization's home page. I'd like to conditionally collapse this data frame on a case-by-case basis so that I'm left with a unique set of identifiers (in the case of my data, either the name of an organization or the organization's web address) for each case along with about 1000+ numeric attributes for these cases that are either the highest or lowest value of however many rows the identifier was associated with before the collapse. To exemplify this, I want to turn:
> df1
x y z
Item1 0 3
Item1 1 4
Item2 1 2
Item3 1 3
Item2 1 5
Item3 1 2
Item4 0 2
Into something like
> df2
x y z
Item1 1 3
Item2 1 2
Item3 1 2
Item4 0 2
In this example, of course, I want to keep the max for Var2 and the min for Var3 and preserve only unique Var1 values.
Can anyone suggest a systematic way to do this for a large data set? Thanks in advance for your help!
One option is to use the plyr package:
library(plyr)
ddply(df, .(x), summarize, y=max(y), z=min(z))
x y z
1 Item1 1 3
2 Item2 1 2
3 Item3 1 2
4 Item4 0 2
Alternatively, and just about as simple, is the package data.table. This option is likely to substantially faster if your data is really large.
library(data.table)
data.table(df)[, list(y=max(y), z=min(z)), by=x]
x y z
1: Item1 1 3
2: Item2 1 2
3: Item3 1 2
4: Item4 0 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With