More Efficient Way To Do A Conditional Running Total In R

Question

As this is my first time asking a question on SO, I apologize in advance for any improper formatting.

I am very new to R and am trying to create a function that will return the row value of a data frame column once a running total in another column has met or exceeded a given value (the row that the running sum begins in is also an argument).

For example, given the following data frame, if given a starting parameter of x=3 and stop parameter of y=17, the function should return 5 (the x value of the row where the sum of y >= 17).

The function as I've currently written it returns the correct answer, but I have to believe there is a much more 'R-ish' way to accomplish this, instead of using loops and incrementing temporary variables, and would like to learn the right way, rather than form bad habits that I will have to correct later.

A very simplified version of the function:

myFunction<-function(DataFrame,StartRow,Total){
    df<-DataFrame[DataFrame[[1]] >= StartRow,]
    i<-0
    j<-0

    while (j < Total) {
        i<-i+1
        j<-sum(df[[2]][1:i])
    }

    x<-df[[1]][i]
    return(x)
}

josliber · Accepted Answer

All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:

library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
    double running = 0.0;
    for (int idx=0; idx < Y.size(); ++idx) {
        if (X[idx] >= start) {
            running += Y[idx];
            if (running >= total) {
                return X[idx];
            }
        }
    }
    return -1;  // Running total never exceeds limit
}")

Comparison with microbenchmark:

get_min_cum <- 
 function(start,total) 
   with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
   dt[X >= start, X[cumsum(Y) >= total][1]]

set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29

library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
               get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
#                               expr        min         lq    median         uq      max neval
#                 get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554   100
#                  get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531   100
#  get_min_cum2(dat$X, dat$Y, 3, 17)   1.157059   1.646184   2.30323   4.628371 256.2487   100

In this case, it's about 100x faster to use the Rcpp solution than other approaches.

More Efficient Way To Do A Conditional Running Total In R

Tags:

loops

r

cumulative-sum

user3351605

1 Answers

josliber

Recent Activity

Donate For Us

More Efficient Way To Do A Conditional Running Total In R

Tags:

loops

r

cumulative-sum

user3351605

1 Answers

josliber

Related questions

Recent Activity

Donate For Us