Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find values in a given interval without a vector scan

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example

>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
   x
1: 3
2: 5
3: 8

Where DT can be a very big table.

Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as

>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
   i min max
1: 1   3  10
2: 2  20  40
> my.data.table.function2(DT,I)
   i  x
1: 1  3
2: 1  5
3: 1  8
4: 2 21
5: 2 34

Where both I and DT can be very big. Thanks a lot

like image 345
user2147028 Avatar asked Dec 06 '25 04:12

user2147028


2 Answers

Here is a variation of the code proposed by @user1935457 (see comment in @user1935457 post)

system.time({

 if(!identical(key(DT), "x")) setkey(DT, x)
 setkey(IT, min)

 #below is the line that differs from @user1935457 
 #Using IT to address the lines of DT creates a smaller intermediate table
 #We can also directly use .I 
 target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
 setattr(IT, "sorted", "max")

 # same here
 target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
 target <- target.low[target.high, nomatch = 0]
 target[, len := max - min + 1L]

 rm(target.low, target.high)
 ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
 setcolorder(ans.roll2, c("i", "x"))
})
#    user  system elapsed 
#    0.07    0.00    0.06 


system.time({ 
 # @user1935457 code
 })
#    user  system elapsed 
#    0.08    0.00    0.08 

identical(ans.roll2, ans.roll)
#[1] TRUE

The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to @user1935457 for your answer.

like image 83
user2147028 Avatar answered Dec 07 '25 18:12

user2147028


First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.

That out of the way, we need a bigger example to exhibit difference from vector scan approach:

require(data.table)

n <- 1e5L
f <- 10L
ni <- n / f

set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni, 
                 min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
                 max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))

DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:

system.time({
  ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
 ##  user  system elapsed 
 ## 84.15    4.48   88.78

One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:

system.time({
  # Save time if DT is already keyed correctly
  if(!identical(key(DT), "x")) setkey(DT, x)

  DT[, row := .I]

  setkey(IT, min)

  target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]

  # Non-overlapping intervals => (sorted by min => sorted by max)
  setattr(IT, "sorted", "max")

  target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]

  target <- target.low[target.high, nomatch = 0]
  target[, len := max - min + 1L]


  rm(target.low, target.high)

  ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
  ans.roll[, row := NULL]
  setcolorder(ans.roll, c("i", "x"))
})
 ## user  system elapsed 
 ## 0.12    0.00    0.12

Ensuring the same row order verifies the result:

setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
like image 39
user1935457 Avatar answered Dec 07 '25 19:12

user1935457



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!