Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to do subsetting in R

Tags:

r

I have a data frame called main that has 400,000 rows and I want to subset it to retrieve 1 or more rows.

As an example here is a data frame which shows the kind of subsetting I am using the subset function:

main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")),
                   name= c("bob","george"),
                   value=c(1,522), 
                   id= c(5,2))

subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)

This works but it is slow and I think it is because I am working with a 400k row data frame. Any ideas how to make subsetting faster?

like image 731
user3022875 Avatar asked Oct 18 '25 15:10

user3022875


1 Answers

I'd suggest using a keyed data.table. Here is how to set that up (for a modified example):

require(data.table)
mainDT <- data.table(main)
setkey(mainDT,V1,V2,V3)

We can now subset based on equality conditions using syntax like

mainDT[J("a","A")]

or

mainDT[J(c("a","b"),"A",1)]

which subsets to where V1 %in% c("a","b") (equivalent to V1=="a"|V1=="b").


Here is a speed comparison:

require(rbenchmark)
benchmark(
  "["       = main[main$V1=="a" & main$V2=="A",],
  "subset"  = subset(main,V1=="a" & V2=="A"),
  "DT[J()]" = mainDT[J("a","A")],
  replications=5
)[,1:6]

which gives these results on my computer:

     test replications elapsed relative user.self sys.self
1       [            5    5.96       NA      5.38     0.57
3 DT[J()]            5    0.00       NA      0.00     0.00
2  subset            5    6.93       NA      6.20     0.72

So, subsetting with J is instant, while the other two methods take several seconds. Subsetting with J in this way is limited, however:

  • It is for equality conditions only.
  • For the simple syntax above, you need to pass arguments in the order of the key. However, you can select where V1=="a" & V3 == 2 using mainDT[J("a",unique(V2),2)] and it's still quite fast.

Everything you can do with a data.frame can also be done with a data.table. For example, subset(mainDT,V1=="a" & V2=="A") still works. So there is nothing lost by switching your data.frames to data.tables, generally. You can convert to a data.table in place with setDT(main).


Here is the code for the example:

n  = 1e7
n3 = 1e3

set.seed(1)
main <- data.frame(
  V1=sample(letters,n,replace=TRUE),
  V2=sample(c(letters,LETTERS),n,replace=TRUE),
  V3=sample(1:n3,n,replace=TRUE),
  V4=rnorm(n))

The improvement seen in the benchmark above will vary with your data. When you have many observations (n) or few unique values for the keys (e.g., n3), the benefit of subsetting with a keyed data.table should be greater.

like image 105
Frank Avatar answered Oct 20 '25 06:10

Frank