Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summing over all previous rows in large column efficiently

Tags:

r

I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.

For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:

sapply(1:100000, function(x) sum(test.data[1:x[1],2]))

I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?

like image 344
J.Streb Avatar asked Oct 23 '25 15:10

J.Streb


1 Answers

Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).

This should work:

test.data$sum <- cumsum(test.data[, 2])

like image 193
Mike H. Avatar answered Oct 25 '25 05:10

Mike H.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!