I have a data.frame with two columns: year and score. The years go from 2000-2012 and each year can be listed multiple times. In the score column I list all the scores for each year with each row having a different score.
What I'd like to do is filter the data.frame so only the rows with the maximum scores for each year remain.
So as a tiny example if I have
year score
2000    18
2001    22
2000    21
I would want to return just
year score
2001    22
2000    21
If you know sql this is easier to understand
library(sqldf)
sqldf('select year, max(score) from mydata group by year')
Update (2016-01): Now you can also use dplyr
library(dplyr)
mydata %>% group_by(year) %>% summarise(max = max(score))
using plyr
require(plyr)
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25))
ddply(df, .(year), summarise, max.score=max(score))
using data.table
require(data.table)
dt <- data.table(df, key="year")
dt[, list(max.score=max(score)), by=year]
using aggregate:
o <- aggregate(df$score, list(df$year) , max)
names(o) <- c("year", "max.score")
using ave:
df1 <- df
df1$max.score <- ave(df1$score, df1$year, FUN=max)
df1 <- df1[!duplicated(df1$year), ]
Edit: In case of more columns, a data.table solution would be the best (my opinion :))
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25), 
               alpha = sample(letters[1:5], 25, replace=T), beta=rnorm(25))
# convert to data.table with key=year
dt <- data.table(df, key="year")
# get the subset of data that matches this criterion
dt[, .SD[score %in% max(score)], by=year]
#     year score alpha       beta
#  1: 2000    20     b  0.8675148
#  2: 2001    21     e  1.5543102
#  3: 2002    22     c  0.6676305
#  4: 2003    18     a -0.9953758
#  5: 2004    23     d  2.1829996
#  6: 2005    25     b -0.9454914
#  7: 2007    17     e  0.7158021
#  8: 2008    12     e  0.6501763
#  9: 2011    24     a  0.7201334
# 10: 2012    19     d  1.2493954
using base packages
> df
  year score
1 2000    18
2 2001    22
3 2000    21
> aggregate(score ~ year, data=df, max)
  year score
1 2000    21
2 2001    22
EDIT
If you have additional columns that you need to keep, then you can user merge with aggregate to get those columns
> df <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21) , hrs = c( 10, 11, 12))
> df
  year score hrs
1 2000    18  10
2 2001    22  11
3 2000    21  12
> merge(aggregate(score ~ year, data=df, max), df, all.x=T)
  year score hrs
1 2000    21  12
2 2001    22  11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With