I have a data frame similar to the following but with 55.000 observations and about 50.000 groups:
d <- structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), b = c(1, 1, 2, 1, 2, 1, 2)), .Names = c("a",
"b"), row.names = c(1L, 3L, 2L, 4L, 5L, 6L, 7L), class = "data.frame")
As in this data frame each group is again ordered depending on variable "b". I now like to split the data frame according to the grouping variable "a" and add a vector that indicates the ordering number of each element of each sub-data-frame. So the result should look like this:
structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), b = c(1, 2, 1, 1, 2, 1, 2), order = c(1,
2, 3, 1, 2, 1, 2)), .Names = c("a", "b", "order"), row.names = c("1",
"2", "3", "4", "5", "6", "7"), class = "data.frame")
I was able to get this result with the split() function an my own gmark() function on the test data frame like this (gmark() assumes the input is already sorted):
gmark <- function(input){
x = 0
result = vector()
for(i in input){
x <- x+1
result <- append(result, x)
}
result
}
x <- split(d, d$a)
x <- lapply(x, function(x){cbind(x, order = gmark(x$b))})
x <- unsplit(x, a)
However, once I apply this to the bigger data frame split() gets very slow and does not return. Is there a way to get this result on a bigger data frame more efficiently?
Here's a solution with data.table package. This'll be much faster.
require(data.table)
DT <- as.data.table(DF)
DT[, order := 1:.N, by=a]
> DT
a b order
1: A 1 1
2: A 2 2
3: A 1 3
4: B 1 1
5: B 2 2
6: C 1 1
7: C 2 2
:= is a data.table operator, that adds columns by reference (meaning no copy of your data is made). And .N is a special variable that contains the length of each group (here, it'll hold 3,2,2 corresponding to A,B,C).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With