Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently add an order variable to a big data frame

I have a data frame similar to the following but with 55.000 observations and about 50.000 groups:

d <- structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), b = c(1, 1, 2, 1, 2, 1, 2)), .Names = c("a", 
"b"), row.names = c(1L, 3L, 2L, 4L, 5L, 6L, 7L), class = "data.frame")

As in this data frame each group is again ordered depending on variable "b". I now like to split the data frame according to the grouping variable "a" and add a vector that indicates the ordering number of each element of each sub-data-frame. So the result should look like this:

structure(list(a = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), b = c(1, 2, 1, 1, 2, 1, 2), order = c(1, 
2, 3, 1, 2, 1, 2)), .Names = c("a", "b", "order"), row.names = c("1", 
"2", "3", "4", "5", "6", "7"), class = "data.frame")

I was able to get this result with the split() function an my own gmark() function on the test data frame like this (gmark() assumes the input is already sorted):

gmark <- function(input){
  x = 0
  result = vector()
  for(i in input){
    x <- x+1
    result <- append(result, x)
  }
  result
}

x <- split(d, d$a)
x <- lapply(x, function(x){cbind(x, order = gmark(x$b))})
x <- unsplit(x, a)

However, once I apply this to the bigger data frame split() gets very slow and does not return. Is there a way to get this result on a bigger data frame more efficiently?

like image 242
fr3d-5 Avatar asked Dec 07 '25 11:12

fr3d-5


1 Answers

Here's a solution with data.table package. This'll be much faster.

require(data.table)
DT <- as.data.table(DF)
DT[, order := 1:.N, by=a]
> DT
   a b order
1: A 1     1
2: A 2     2
3: A 1     3
4: B 1     1
5: B 2     2
6: C 1     1
7: C 2     2

:= is a data.table operator, that adds columns by reference (meaning no copy of your data is made). And .N is a special variable that contains the length of each group (here, it'll hold 3,2,2 corresponding to A,B,C).

like image 168
Arun Avatar answered Dec 09 '25 01:12

Arun