Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract every combination of numeric values once they reach a target sum

I have a dataframe of two columns like so:

Name Salary
Name 1 5000000
Name 2 7000000
Name 3 9000000
Name 4 12000000
Name 5 14000000

I want to find every combination of salaries that reach 20,000,000. It can go over 20,000,000, but once it does, it stops. For example, names 1, 2 and 3 would be a valid combination, as would names 4 and 5.

I would also love to be able to view the salary combinations with their associates names in a dataframe.

Note that my dataframes typically have about 12 to 15 names across 30 different dataframes.

So far, I've tried to wrap my head around combn(), but I could only choose a specific number "m" as far as I know and could not choose a target number when the combinations stop. I also couldn't get anywhere with the "sets" package.

like image 992
ColinM Avatar asked Oct 15 '25 19:10

ColinM


2 Answers

Here is one way that enumerates all the combinations and selects the ones that are feasible. Note that this is very inefficient as since once can write a function that is short circuited. ie if I know that all 3 combinations are greater than the target, then there is no need to do 4,5,6,etc combinations. Also If I know that a subset is greater than a target, there is no need to do any combination containing that particular subset. I ignored all these thus why I said the method is inefficient:

g <- function(df){
    n <- seq_len(nrow(df))
    df <- df[order(-df$Salary),]
    fn <- \(x)if(sum(cumsum(df$Salary[x])>20e6) == 1)df$Name[x]
    gn <- \(x)Filter(length, combn(n, x, fn, simplify = FALSE))
    unlist(sapply(n, gn), FALSE)
}
g(df)

[[1]]
[1] "Name 5" "Name 4"

[[2]]
[1] "Name 5" "Name 3"

[[3]]
[1] "Name 5" "Name 2"

[[4]]
[1] "Name 4" "Name 3"

[[5]]
[1] "Name 4" "Name 2" "Name 1"

[[6]]
[1] "Name 3" "Name 2" "Name 1"
like image 109
KU99 Avatar answered Oct 18 '25 08:10

KU99


You can try the following base R code, where the search stops whenever the combination sum reaches the given given target tg.

The idea is that: We can imagine that the search is along a tree, and transverse stops when the condition is met, where the accumulated visited nodes along the path is treated as a new parent to check the incoming child nodes.

Code

f <- function(df, tg) {
  v <- sort(with(df, setNames(Salary, Name)), TRUE)
  root <- Map(setNames, v, names(v))
  out <- c()
  repeat {
    leaf <- c()
    for (r in root) {
      candidates <- v[v < tail(r, 1)]
      if (length(candidates)) {
        for (k in seq_along(candidates)) {
          p <- c(r, candidates[k])
          if (sum(p) >= tg) {
            out <- c(out, list(p))
          } else {
            leaf <- c(leaf, list(p))
          }
        }
      }
    }
    if (length(root) == length(leaf)) {
      return(out)
    }
    root <- leaf
  }
}

Output

> f(df, 20e6)
[[1]]
   Name5    Name4
14000000 12000000

[[2]]
   Name5    Name3
14000000  9000000

[[3]]
   Name5    Name2 
14000000  7000000

[[4]]
   Name4    Name3
12000000  9000000

[[5]]
   Name4    Name2    Name1
12000000  7000000  5000000

[[6]]
  Name3   Name2   Name1
9000000 7000000 5000000

Benchmark

with the given df, we compare two approaches

f <- function(df, tg) {
  v <- sort(with(df, setNames(Salary, Name)), TRUE)
  root <- Map(setNames, v, names(v))
  out <- c()
  repeat {
    leaf <- c()
    for (r in root) {
      candidates <- v[v < tail(r, 1)]
      if (length(candidates)) {
        for (k in seq_along(candidates)) {
          p <- c(r, candidates[k])
          if (sum(p) >= tg) {
            out <- c(out, list(p))
          } else {
            leaf <- c(leaf, list(p))
          }
        }
      }
    }
    if (length(root) == length(leaf)) {
      return(out)
    }
    root <- leaf
  }
}

g <- function(df, tg) {
  n <- seq_len(nrow(df))
  df <- df[order(-df$Salary), ]
  fn <- \(x)if (sum(cumsum(df$Salary[x]) >= tg) == 1) df$Name[x]
  gn <- \(x)Filter(length, combn(n, x, fn, simplify = FALSE))
  unlist(sapply(n, gn), FALSE)
}

microbenchmark(
  TIC = f(df, 20e6),
  Onyambu = g(df, 20e6),
  unit = "relative"
)

we can see (but we should be aware that the speed also depends on the target value, so the benchmarking below is just for reference regarding OP's specific example data in the post, should be tested in more general cases)

Unit: relative
    expr      min      lq     mean   median       uq       max neval
     TIC 1.000000 1.00000 1.000000 1.000000 1.000000 1.0000000   100
 Onyambu 1.627361 1.62317 1.240159 1.645295 1.625882 0.6493649   100
like image 30
ThomasIsCoding Avatar answered Oct 18 '25 10:10

ThomasIsCoding



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!