Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapsing data case by case in R

I'm working with a data set that contains information on anti-human trafficking organizations. The organizations are identified by either the organization names or the web address of the organization's home page. I'd like to conditionally collapse this data frame on a case-by-case basis so that I'm left with a unique set of identifiers (in the case of my data, either the name of an organization or the organization's web address) for each case along with about 1000+ numeric attributes for these cases that are either the highest or lowest value of however many rows the identifier was associated with before the collapse. To exemplify this, I want to turn:

> df1
x      y     z
Item1  0     3
Item1  1     4
Item2  1     2
Item3  1     3
Item2  1     5
Item3  1     2
Item4  0     2

Into something like

> df2
x     y      z
Item1  1     3
Item2  1     2
Item3  1     2
Item4  0     2

In this example, of course, I want to keep the max for Var2 and the min for Var3 and preserve only unique Var1 values.

Can anyone suggest a systematic way to do this for a large data set? Thanks in advance for your help!

like image 999
Nina Avatar asked Jan 31 '26 18:01

Nina


1 Answers

One option is to use the plyr package:

library(plyr)
ddply(df, .(x), summarize, y=max(y), z=min(z))
      x y z
1 Item1 1 3
2 Item2 1 2
3 Item3 1 2
4 Item4 0 2

Alternatively, and just about as simple, is the package data.table. This option is likely to substantially faster if your data is really large.

library(data.table)
data.table(df)[, list(y=max(y), z=min(z)), by=x]
       x y z
1: Item1 1 3
2: Item2 1 2
3: Item3 1 2
4: Item4 0 2
like image 194
Andrie Avatar answered Feb 02 '26 10:02

Andrie