Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create new column that cumulatively sums to cumulative sum of existing column?

Using tidyverse functions, I'm looking to create a new column of data that sums up to the cumulative sum in the first column, but uses increments no greater than incr.

That could start with df

incr <- 1.5
df <- data.frame(a = c(6,0,0,2.5,0,0))
df
    a
1 6.0
2 0.0
3 0.0
4 2.5
5 0.0
6 0.0

then create a new column b using incr:

    a   b
1 6.0 1.5
2 0.0 1.5
3 0.0 1.5
4 2.5 1.5
5 0.0 1.5
6 0.0 1.0

I found tidyr::uncount() and it seemed like a function that could work, but I need to use non-integer increments. Generally trying to mutate and use vectorized functions and have put some thought towards running it rowwise(), but most of my ideas require iteration.

From comment below: If analogies are helpful, think of it like queued downloads. Column a shows you press to download 6 MB at time 1, 0 MB for times 2 and 3, then you press to download 2.5 MB at time 4. Your connection, however, can only download at the speed of incr. So, if incr is 1.5, column b shows what actually downloaded. You fully use that connection speed each period until you download the final residual (1.0) in time 6.

To better highlight the dimensionality, here's another df:

incr <- 1.5
df <- data.frame(a = rep(0,100),b=rep(0,100))
df$a[c(30,33,38)] = c(6,2.5,1)
df[30:39,]
     a b
30 6.0 0
31 0.0 0
32 0.0 0
33 2.5 0
34 0.0 0
35 0.0 0
36 0.0 0
37 0.0 0
38 1.0 0
39 0.0 0

with a desired output

     a   b
30 6.0 1.5
31 0.0 1.5
32 0.0 1.5
33 2.5 1.5
34 0.0 1.5
35 0.0 1.0
36 0.0 0.0
37 0.0 0.0
38 1.0 1.0
39 0.0 0.0
like image 508
Joseff Kolman Avatar asked Nov 05 '25 08:11

Joseff Kolman


1 Answers

because I don't know when to give up, I think the only solution is iteration. But I think you can always do iteration in C++ to make it as super-speedy as a vectorised function:

Edited as more efficient function (that doesn't keep calculating fresh column sums)

Rcpp::cppFunction(
  "
    NumericVector iterate_to_cumsum(NumericVector v1, double incr) {
    int x = v1.size();
    NumericVector v2(x);
  
    double sumv1 = 0;
    double sumv2 = 0;
  
    for (int i = 0; i < x; ++i) {
      sumv1 += v1[i];
      v2[i] = std::min(incr, sumv1 - sumv2);
      sumv2 += v2[i];
    }

    return v2;

  }
"
)

library(tidyverse)

df <- data.frame(a = c(6, 0, 0, 2.5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0))
incr <- 1.5

df |> 
  mutate(b = iterate_to_cumsum(a, incr))
#>      a   b
#> 1  6.0 1.5
#> 2  0.0 1.5
#> 3  0.0 1.5
#> 4  2.5 1.5
#> 5  0.0 1.5
#> 6  0.0 1.0
#> 7  0.0 0.0
#> 8  0.0 0.0
#> 9  1.0 1.0
#> 10 0.0 0.0
#> 11 3.0 1.5
#> 12 0.0 1.5
#> 13 0.0 0.0
#> 14 0.0 0.0

An actually working Rcpp function does indeed speed things up a little (running tests as in @LMc's post):

# @Andy Baxter
cpp <- function(){
  df |> 
    mutate(b = iterate_to_cumsum(a, incr))
}

# @LMc
fun_factory <- function(){
  df |>
    mutate(b = map_dbl(a, incr_cap(1.5)))
}

bench::mark(
  cpp(),
  fun_factory()
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cpp()           1.53ms   1.77ms    474.      2.15MB     8.00
#> 2 fun_factory() 116.79ms 124.31ms      8.15  786.48KB    19.6
like image 55
Andy Baxter Avatar answered Nov 07 '25 09:11

Andy Baxter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!