Finding position of the first TRUE of a series of `n` TRUEs

Question

From a vector of TRUE/FALSE

set.seed(1)
x = rnorm(1503501) > 0

I am seeking for a performant (fast) method for getting the position of the first TRUE of the first series of n TRUEs.

The vectors (x) I am dealing with contain exactly 1503501 elements (with the exception of some of them that are much shorter). Below is my current solution. It uses for loop but for loops are extremely slow in R. Are there nicer and especially faster solutions?

n = 20

count = 0
solution = -1
for (i in 1:length(x)){
    if (x[i]){
        count = count + 1
        if (count == n){solution = i+1-n; break}
    } else {count = 0}
}
print(solution)
1182796

I was thinking about using vectorized functions and doing something like y = which(x) or eventually y = paste(which(x)) and seek for particular pattern but I am not sure how to do that.

Marat Talipov · Accepted Answer

You can use Rcpp:

library(Rcpp)
cppFunction('int fC(LogicalVector x, int n) {
  int xs = x.size();
  int count = 0;
  int solution = -1;
  for (int i = 0; i < xs; ++i) {
    if (x[i]){
      if (++count == n){solution = i+2-n; break;}
    } else {
      count = 0;
    }
  }
  return solution;
}')

Here is a small benchmarking study:

f1 <- function(x,n) {
  count = 0
  solution = -1
  for (i in 1:length(x)){
    if (x[i]){
      count = count + 1
      if (count == n){solution = i+1-n; break}
    } else {count = 0}
  }
  solution
}


set.seed(1)
x = rnorm(150350100) > 0
n = 20

print(f1(x,n)==fC(x,n))
# [1] TRUE


library(rbenchmark)
benchmark(f1(x,n),fC(x,n))
#       test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n)          100  80.038  180.673    63.300   16.686          0         0
# 2 fC(x, n)          100   0.443    1.000     0.442    0.000          0         0

[Updated benchmark]

# Suggested by BondedDust
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths; 
                          sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}

set.seed(1)
x = rnorm(1503501) > 0
n = 20

print(f1(x,n)==fC(x,n))
# [1] TRUE
print(f1(x,n)==tpos(x,n))
# [1] TRUE


benchmark(f1(x,n),fC(x,n),tpos(x,n),replications = 10)
#         test replications elapsed relative user.self sys.self user.child sys.child
# 1   f1(x, n)           10   4.756  110.605     4.735    0.020          0         0
# 2   fC(x, n)           10   0.043    1.000     0.043    0.000          0         0
# 3 tpos(x, n)           10   2.591   60.256     2.376    0.205          0         0

IRTFM · Answer

Take a look at this transcript (using only a much smaller random sample). I think it fairly clear that it will be easy to write a function that picks out the first position that satisfies the joint condition and use cumsum on the lengths up to that point:

> x = rnorm(1500) > 0

> rle(x)
Run Length Encoding
  lengths: int [1:751] 1 1 1 2 1 3 1 2 2 1 ...
  values : logi [1:751] FALSE TRUE FALSE TRUE FALSE TRUE ...
> table( rle(x)$lengths )

  1   2   3   4   5   6   7   8   9 
368 193  94  46  33  10   2   4   1 
> table( rle(x)$lengths , rle(x)$values)

    FALSE TRUE
  1   175  193
  2   100   93
  3    47   47
  4    23   23
  5    21   12
  6     5    5
  7     2    0
  8     3    1
  9     0    1
> which( rle(x)$lengths==8 & rle(x)$values==TRUE)
[1] 542
> which( rle(x)$lengths==7 & rle(x)$values==TRUE)
integer(0)
> which( rle(x)$lengths==6 & rle(x)$values==TRUE)
[1]  12 484 510 720 744

This is my candidate function:

 tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths; 
            sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
 tpos(x,6)
#[1] 18

Note that I subtracted one from the first index so the length of the first qualifying run of TRUE's would not be added in, and then added one to that sum so that the position of the first such TRUE would be calculated. I'm guessing the positon of the first run of n-TRUEs will be distributed as one of the extreme value distributions (although it's not always monotonic increase)

>  tpos(x,8)
[1] 1045
> tpos(x,8)
[1] 1045
> tpos(x,9)
[1] 1417
> tpos(x,10)
[1] 4806
> tpos(x,11)
[1] 2845
> tpos(x,12)
Error in 1:(which(len == pos & rl$values == TRUE)[1] - 1) : 
  NA/NaN argument
> set.seed(1)
> x = rnorm(30000) > 0
> tpos(x,12)
[1] 23509

Finding position of the first TRUE of a series of `n` TRUEs

Tags:

performance

r

count

boolean

Remi.b

2 Answers

Marat Talipov

IRTFM

Recent Activity

Donate For Us

Finding position of the first TRUE of a series of `n` TRUEs

Tags:

performance

r

count

boolean

Remi.b

2 Answers

Marat Talipov

IRTFM

Related questions

Recent Activity

Donate For Us