I have data that looks like this:
score temp
1 a.score 0.05502011
2 b.score 0.02484594
3 c.score -0.07183767
4 d.score -0.06932274
5 e.score -0.15512460
I want to sort the sames based on the values from most negative to most positive, taking the top 4. I try:
> topfour.values <- apply(temp.df, 2, function(xx)head(sort(xx), 4, na.rm = TRUE, decreasing = FALSE))
> topfour.names <- apply(temp.df, 2, function(xx)head(names(sort(xx)), 4, na.rm = TRUE))
> topfour <- rbind(topfour.names, topfour.values)
and I get
> topfour.values
temp[, 1]
d.score "-0.06932274"
c.score "-0.0718376680"
e.score "-0.1551246"
b.score " 0.02484594"
What order is this? What did I do wrong and how do I get it sorted properly?
I've tried method == "Quick" and method == "Shell" as options, but the order still doesn't make sense.
It is my belief that you are getting your data in the wrong type. It would be useful to know how you are getting your data into R. In the example above you are handling a character vector not a numeric one.
head(with(df, df[order(temp), ]), 4)
score temp
5 e.score -0.15512460
3 c.score -0.07183767
4 d.score -0.06932274
2 b.score 0.02484594
Taking the proposed approach from Greg Snow, and considering that you are only interested in the vector of top values, and it is impossible to use the partial argument in this case, a simple speed test on comparing order and sorl.list shows that the differences may be irrelevant, even for a 1e7 size vector.
df1 <- data.frame(temp = rnorm(1e+7),
score = sample(letters, 1e+7, rep = T))
library(microbenchmark)
microbenchmark(
head(with(df1, df1[order(temp), 1]), 4),
head(with(df1, df1[sort.list(temp), 1]), 4),
head(df1[order(df1$temp), 1], 4),
head(df1[sort.list(df1$temp), 1], 4),
times = 1L
)
Unit: seconds
expr min lq median uq max neval
head(with(df1, df1[order(temp), 1]), 4) 13.42581 13.42581 13.42581 13.42581 13.42581 1
head(with(df1, df1[sort.list(temp), 1]), 4) 13.80256 13.80256 13.80256 13.80256 13.80256 1
head(df1[order(df1$temp), 1], 4) 13.88580 13.88580 13.88580 13.88580 13.88580 1
head(df1[sort.list(df1$temp), 1], 4) 13.13579 13.13579 13.13579 13.13579 13.13579 1
There are several problems, some of which have been discussed in the comments, but one big one that I have not seen mentioned yet is that the apply function works on matrices and therefore converts your data frame to a matrix before doing anything else. Since your data has both a factor and a numeric variable the numbers are converted to character strings and the sorting is done on the character string representation, not the numerical value. Using the tools that work directly with data frames (and lists) will prevent this as well as using order and avoiding apply altogether.
Also, if you only want the $n$ largest or smallest values then you may be able to speed things up a little by using sort.list instead of order and specifying the partial argument.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With