Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding position of the first TRUE of a series of `n` TRUEs

From a vector of TRUE/FALSE

set.seed(1)
x = rnorm(1503501) > 0

I am seeking for a performant (fast) method for getting the position of the first TRUE of the first series of n TRUEs.

The vectors (x) I am dealing with contain exactly 1503501 elements (with the exception of some of them that are much shorter). Below is my current solution. It uses for loop but for loops are extremely slow in R. Are there nicer and especially faster solutions?

n = 20

count = 0
solution = -1
for (i in 1:length(x)){
    if (x[i]){
        count = count + 1
        if (count == n){solution = i+1-n; break}
    } else {count = 0}
}
print(solution)
1182796

I was thinking about using vectorized functions and doing something like y = which(x) or eventually y = paste(which(x)) and seek for particular pattern but I am not sure how to do that.

like image 861
Remi.b Avatar asked Dec 12 '25 13:12

Remi.b


2 Answers

You can use Rcpp:

library(Rcpp)
cppFunction('int fC(LogicalVector x, int n) {
  int xs = x.size();
  int count = 0;
  int solution = -1;
  for (int i = 0; i < xs; ++i) {
    if (x[i]){
      if (++count == n){solution = i+2-n; break;}
    } else {
      count = 0;
    }
  }
  return solution;
}')

Here is a small benchmarking study:

f1 <- function(x,n) {
  count = 0
  solution = -1
  for (i in 1:length(x)){
    if (x[i]){
      count = count + 1
      if (count == n){solution = i+1-n; break}
    } else {count = 0}
  }
  solution
}


set.seed(1)
x = rnorm(150350100) > 0
n = 20

print(f1(x,n)==fC(x,n))
# [1] TRUE


library(rbenchmark)
benchmark(f1(x,n),fC(x,n))
#       test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n)          100  80.038  180.673    63.300   16.686          0         0
# 2 fC(x, n)          100   0.443    1.000     0.442    0.000          0         0

[Updated benchmark]

# Suggested by BondedDust
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths; 
                          sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}

set.seed(1)
x = rnorm(1503501) > 0
n = 20

print(f1(x,n)==fC(x,n))
# [1] TRUE
print(f1(x,n)==tpos(x,n))
# [1] TRUE


benchmark(f1(x,n),fC(x,n),tpos(x,n),replications = 10)
#         test replications elapsed relative user.self sys.self user.child sys.child
# 1   f1(x, n)           10   4.756  110.605     4.735    0.020          0         0
# 2   fC(x, n)           10   0.043    1.000     0.043    0.000          0         0
# 3 tpos(x, n)           10   2.591   60.256     2.376    0.205          0         0
like image 62
Marat Talipov Avatar answered Dec 14 '25 05:12

Marat Talipov


Take a look at this transcript (using only a much smaller random sample). I think it fairly clear that it will be easy to write a function that picks out the first position that satisfies the joint condition and use cumsum on the lengths up to that point:

> x = rnorm(1500) > 0

> rle(x)
Run Length Encoding
  lengths: int [1:751] 1 1 1 2 1 3 1 2 2 1 ...
  values : logi [1:751] FALSE TRUE FALSE TRUE FALSE TRUE ...
> table( rle(x)$lengths )

  1   2   3   4   5   6   7   8   9 
368 193  94  46  33  10   2   4   1 
> table( rle(x)$lengths , rle(x)$values)

    FALSE TRUE
  1   175  193
  2   100   93
  3    47   47
  4    23   23
  5    21   12
  6     5    5
  7     2    0
  8     3    1
  9     0    1
> which( rle(x)$lengths==8 & rle(x)$values==TRUE)
[1] 542
> which( rle(x)$lengths==7 & rle(x)$values==TRUE)
integer(0)
> which( rle(x)$lengths==6 & rle(x)$values==TRUE)
[1]  12 484 510 720 744

This is my candidate function:

 tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths; 
            sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
 tpos(x,6)
#[1] 18

Note that I subtracted one from the first index so the length of the first qualifying run of TRUE's would not be added in, and then added one to that sum so that the position of the first such TRUE would be calculated. I'm guessing the positon of the first run of n-TRUEs will be distributed as one of the extreme value distributions (although it's not always monotonic increase)

>  tpos(x,8)
[1] 1045
> tpos(x,8)
[1] 1045
> tpos(x,9)
[1] 1417
> tpos(x,10)
[1] 4806
> tpos(x,11)
[1] 2845
> tpos(x,12)
Error in 1:(which(len == pos & rl$values == TRUE)[1] - 1) : 
  NA/NaN argument
> set.seed(1)
> x = rnorm(30000) > 0
> tpos(x,12)
[1] 23509
like image 37
IRTFM Avatar answered Dec 14 '25 03:12

IRTFM



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!