I have a dataframe which I want to subset depending on whether a value in a certain column is among the top 5 values of all the columns in that row. This is a simplified version of my dataframe:
> my.df <- data.frame(a = rnorm(10,5), b= rnorm(10,5), c=rnorm(10,5), d=rnorm(10,5), e=rnorm(10,5))
> my.df
a b c d e
1 6.401462 5.318849 5.373496 5.101140 3.710973
2 6.715845 4.786936 3.521965 4.264029 4.525138
3 6.076211 5.356114 5.605134 5.443002 5.296778
4 7.009623 5.275595 4.801874 4.355892 6.752737
5 5.002059 6.163398 6.063694 2.409702 6.172111
6 6.298305 3.291884 5.737053 4.701320 4.752406
7 4.856246 4.674743 5.550828 7.501786 5.466611
8 5.037990 4.129333 4.797334 5.143915 5.558161
9 4.903592 3.135622 5.879798 5.639893 4.368915
10 5.500374 4.400130 3.980433 6.203259 4.498614
Now I want only the rows where the values of column a or column b are among the top 2 values in their row. So in this example deleting row 7-9, giving:
a b c d e
1 6.401462 5.318849 5.373496 5.101140 3.710973
2 6.715845 4.786936 3.521965 4.264029 4.525138
3 6.076211 5.356114 5.605134 5.443002 5.296778
4 7.009623 5.275595 4.801874 4.355892 6.752737
5 5.002059 6.163398 6.063694 2.409702 6.172111
6 6.298305 3.291884 5.737053 4.701320 4.752406
10 5.500374 4.400130 3.980433 6.203259 4.498614
Any ideas?
We can loop through the rows with apply (from base R) check whether any of the elements in 'a' or 'b' are %in% the sorted group to create a logical index and subset the rows based on that
i1 <- apply(my.df, 1, function(x) any(x[1:2] %in% sort(x, decreasing = TRUE)[1:2]))
my.df[i1,]
# a b c d e
#1 6.401462 5.318849 5.373496 5.101140 3.710973
#2 6.715845 4.786936 3.521965 4.264029 4.525138
#3 6.076211 5.356114 5.605134 5.443002 5.296778
#4 7.009623 5.275595 4.801874 4.355892 6.752737
#5 5.002059 6.163398 6.063694 2.409702 6.172111
#6 6.298305 3.291884 5.737053 4.701320 4.752406
#10 5.500374 4.400130 3.980433 6.203259 4.498614
Or use max.col from base R to create the logical index and that would be much faster and avoid any transformation
i1 <- max.col(my.df, "first")
i2 <- max.col(replace(my.df, cbind(seq_len(nrow(my.df)), i1), -Inf), "first")
my.df[(i1 %in% 1:2) | (i2 %in% 1:2), ]
my.df <- structure(list(a = c(6.401462, 6.715845, 6.076211, 7.009623,
5.002059, 6.298305, 4.856246, 5.03799, 4.903592, 5.500374), b = c(5.318849,
4.786936, 5.356114, 5.275595, 6.163398, 3.291884, 4.674743, 4.129333,
3.135622, 4.40013), c = c(5.373496, 3.521965, 5.605134, 4.801874,
6.063694, 5.737053, 5.550828, 4.797334, 5.879798, 3.980433),
d = c(5.10114, 4.264029, 5.443002, 4.355892, 2.409702, 4.70132,
7.501786, 5.143915, 5.639893, 6.203259), e = c(3.710973,
4.525138, 5.296778, 6.752737, 6.172111, 4.752406, 5.466611,
5.558161, 4.368915, 4.498614)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With