From a vector of TRUE/FALSE
set.seed(1)
x = rnorm(1503501) > 0
I am seeking for a performant (fast) method for getting the position of the first TRUE of the first series of n TRUEs.
The vectors (x) I am dealing with contain exactly 1503501 elements (with the exception of some of them that are much shorter). Below is my current solution. It uses for loop but for loops are extremely slow in R. Are there nicer and especially faster solutions?
n = 20
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
print(solution)
1182796
I was thinking about using vectorized functions and doing something like y = which(x) or eventually y = paste(which(x)) and seek for particular pattern but I am not sure how to do that.
You can use Rcpp:
library(Rcpp)
cppFunction('int fC(LogicalVector x, int n) {
int xs = x.size();
int count = 0;
int solution = -1;
for (int i = 0; i < xs; ++i) {
if (x[i]){
if (++count == n){solution = i+2-n; break;}
} else {
count = 0;
}
}
return solution;
}')
Here is a small benchmarking study:
f1 <- function(x,n) {
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
solution
}
set.seed(1)
x = rnorm(150350100) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
library(rbenchmark)
benchmark(f1(x,n),fC(x,n))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 100 80.038 180.673 63.300 16.686 0 0
# 2 fC(x, n) 100 0.443 1.000 0.442 0.000 0 0
[Updated benchmark]
# Suggested by BondedDust
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
set.seed(1)
x = rnorm(1503501) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
print(f1(x,n)==tpos(x,n))
# [1] TRUE
benchmark(f1(x,n),fC(x,n),tpos(x,n),replications = 10)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 10 4.756 110.605 4.735 0.020 0 0
# 2 fC(x, n) 10 0.043 1.000 0.043 0.000 0 0
# 3 tpos(x, n) 10 2.591 60.256 2.376 0.205 0 0
Take a look at this transcript (using only a much smaller random sample). I think it fairly clear that it will be easy to write a function that picks out the first position that satisfies the joint condition and use cumsum on the lengths up to that point:
> x = rnorm(1500) > 0
> rle(x)
Run Length Encoding
lengths: int [1:751] 1 1 1 2 1 3 1 2 2 1 ...
values : logi [1:751] FALSE TRUE FALSE TRUE FALSE TRUE ...
> table( rle(x)$lengths )
1 2 3 4 5 6 7 8 9
368 193 94 46 33 10 2 4 1
> table( rle(x)$lengths , rle(x)$values)
FALSE TRUE
1 175 193
2 100 93
3 47 47
4 23 23
5 21 12
6 5 5
7 2 0
8 3 1
9 0 1
> which( rle(x)$lengths==8 & rle(x)$values==TRUE)
[1] 542
> which( rle(x)$lengths==7 & rle(x)$values==TRUE)
integer(0)
> which( rle(x)$lengths==6 & rle(x)$values==TRUE)
[1] 12 484 510 720 744
This is my candidate function:
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
tpos(x,6)
#[1] 18
Note that I subtracted one from the first index so the length of the first qualifying run of TRUE's would not be added in, and then added one to that sum so that the position of the first such TRUE would be calculated. I'm guessing the positon of the first run of n-TRUEs will be distributed as one of the extreme value distributions (although it's not always monotonic increase)
> tpos(x,8)
[1] 1045
> tpos(x,8)
[1] 1045
> tpos(x,9)
[1] 1417
> tpos(x,10)
[1] 4806
> tpos(x,11)
[1] 2845
> tpos(x,12)
Error in 1:(which(len == pos & rl$values == TRUE)[1] - 1) :
NA/NaN argument
> set.seed(1)
> x = rnorm(30000) > 0
> tpos(x,12)
[1] 23509
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With