I have a data frame called main
that has 400,000 rows and I want to subset it to retrieve 1 or more rows.
As an example here is a data frame which shows the kind of subsetting I am using the subset
function:
main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")),
name= c("bob","george"),
value=c(1,522),
id= c(5,2))
subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)
This works but it is slow and I think it is because I am working with a 400k row data frame. Any ideas how to make subsetting faster?
I'd suggest using a keyed data.table
. Here is how to set that up (for a modified example):
require(data.table)
mainDT <- data.table(main)
setkey(mainDT,V1,V2,V3)
We can now subset based on equality conditions using syntax like
mainDT[J("a","A")]
or
mainDT[J(c("a","b"),"A",1)]
which subsets to where V1 %in% c("a","b")
(equivalent to V1=="a"|V1=="b"
).
Here is a speed comparison:
require(rbenchmark)
benchmark(
"[" = main[main$V1=="a" & main$V2=="A",],
"subset" = subset(main,V1=="a" & V2=="A"),
"DT[J()]" = mainDT[J("a","A")],
replications=5
)[,1:6]
which gives these results on my computer:
test replications elapsed relative user.self sys.self
1 [ 5 5.96 NA 5.38 0.57
3 DT[J()] 5 0.00 NA 0.00 0.00
2 subset 5 6.93 NA 6.20 0.72
So, subsetting with J
is instant, while the other two methods take several seconds. Subsetting with J
in this way is limited, however:
V1=="a" & V3 == 2
using mainDT[J("a",unique(V2),2)]
and it's still quite fast.Everything you can do with a data.frame can also be done with a data.table. For example, subset(mainDT,V1=="a" & V2=="A")
still works. So there is nothing lost by switching your data.frames to data.tables, generally. You can convert to a data.table in place with setDT(main)
.
Here is the code for the example:
n = 1e7
n3 = 1e3
set.seed(1)
main <- data.frame(
V1=sample(letters,n,replace=TRUE),
V2=sample(c(letters,LETTERS),n,replace=TRUE),
V3=sample(1:n3,n,replace=TRUE),
V4=rnorm(n))
The improvement seen in the benchmark above will vary with your data. When you have many observations (n
) or few unique values for the keys (e.g., n3
), the benefit of subsetting with a keyed data.table should be greater.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With