Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to split an array into P subarrays of balanced sum

I have an big array of length N, let's say something like:

2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1

I need to split this array into P subarrays (in this example, P=4 would be reasonable), such that the sum of the elements in each subarray is as close as possible to sigma, being:

sigma=(sum of all elements in original array)/P

In this example, sigma=15.

For the sake of clarity, one possible result would be:

2 4 6    7 6 3 3   3 4 3 4    4 4 3 3 1
(sums: 12,19,14,15)

I have written a very naive algorithm based in how I would do the divisions by hand, but I don't know how to impose the condition that a division whose sums are (14,14,14,14,19) is worse than one that is (15,14,16,14,16).

Thank you in advance.

like image 826
Renoa Avatar asked Sep 06 '25 10:09

Renoa


2 Answers

First, let’s formalize your optimization problem by specifying the input, output, and the measure for each possible solution (I hope this is in your interest):

Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.

Input: Array A of positive integers; P is a positive integer.
Output: Array SA of P non-negative integers representing the length of each subarray of A where the sum of these subarray lengths is equal to the length of A.
Measure: abs(sum(sa)-sum(A)/P) is minimal for each sa ∈ {sa | sa = (Ai, …, Ai+‍SAj) for i = (Σ SAj), j from 0 to P-1}.

The input and output define the set of valid solutions. The measure defines a measure to compare multiple valid solutions. And since we’re looking for a solution with the least difference to the perfect solution (minimization problem), measure should also be minimal.

With this information, it is quite easy to implement the measure function (here in Python):

def measure(a, sa):
    sigma = sum(a)/len(sa)
    diff = 0
    i = 0
    for j in xrange(0, len(sa)):
        diff += abs(sum(a[i:i+sa[j]])-sigma)
        i += sa[j]
    return diff

print measure([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], [3,4,4,5]) # prints 8

Now finding an optimal solution is a little harder.

We can use the Backtracking algorithm for finding valid solutions and use the measure function to rate them. We basically try all possible combinations of P non-negative integer numbers that sum up to length(A) to represent all possible valid solutions. Although this ensures not to miss a valid solution, it is basically a brute-force approach with the benefit that we can omit some branches that cannot be any better than our yet best solution. E.g. in the example above, we wouldn’t need to test solutions with [9,…] (measure > 38) if we already have a solution with measure ≤ 38.

Following the pseudocode pattern from Wikipedia, our bt function looks as follows:

def bt(c):
    global P, optimum, optimum_diff
    if reject(P,c):
        return
    if accept(P,c):
        print "%r with %d" % (c, measure(P,c))
        if measure(P,c) < optimum_diff:
            optimum = c
            optimum_diff = measure(P,c)
        return
    s = first(P,c)
    while s is not None:
        bt(list(s))
        s = next(P,s)

The global variables P, optimum, and optimum_diff represent the problem instance holding the values for A, P, and sigma, as well as the optimal solution and its measure:

class MinimalSumOfSubArraySumsProblem:
    def __init__(self, a, p):
        self.a = a
        self.p = p
        self.sigma = sum(a)/p

Next we specify the reject and accept functions that are quite straight forward:

def reject(P,c):
    return optimum_diff < measure(P,c)
def accept(P,c):
    return None not in c

This simply rejects any candidate whose measure is already more than our yet optimal solution. And we’re accepting any valid solution.

The measure function is also slightly changed due to the fact that c can now contain None values:

def measure(P, c):
    diff = 0
    i = 0
    for j in xrange(0, P.p):
        if c[j] is None:
            break;
        diff += abs(sum(P.a[i:i+c[j]])-P.sigma)
        i += c[j]
    return diff

The remaining two function first and next are a little more complicated:

def first(P,c):
    t = 0
    is_complete = True
    for i in xrange(0, len(c)):
        if c[i] is None:
            if i+1 < len(c):
                c[i] = 0
            else:
                c[i] = len(P.a) - t
            is_complete = False
            break;
        else:
            t += c[i]
    if is_complete:
        return None
    return c

def next(P,s):
    t = 0
    for i in xrange(0, len(s)):
        t += s[i]
        if i+1 >= len(s) or s[i+1] is None:
            if t+1 > len(P.a):
                return None
            else:
                s[i] += 1
            return s

Basically, first either replaces the next None value in the list with either 0 if it’s not the last value in the list or with the remainder to represent a valid solution (little optimization here) if it’s the last value in the list, or it return None if there is no None value in the list. next simply increments the rightmost integer by one or returns None if an increment would breach the total limit.

Now all you need is to create a problem instance, initialize the global variables and call bt with the root:

P = MinimalSumOfSubArraySumsProblem([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], 4)
optimum = None
optimum_diff = float("inf")
bt([None]*P.p)
like image 127
Gumbo Avatar answered Sep 10 '25 22:09

Gumbo


@Gumbo 's answer is clear and actionable, but consumes lots of time when length(A) bigger than 400 and P bigger than 8. This is because that algorithm is kind of brute-forcing with benefits as he said.

In fact, a very fast solution is using dynamic programming.

Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.

Measure: , where is sum of elements of subarray , is the average of P subarray' sums.

This can make sure the balance of sum, because it use the definition of Standard Deviation.

Persuming that array A has N elements; Q(i,j) means the minimum Measure value when split the last i elements of A into j subarrays. D(i,j) means (sum(B)-sum(A)/P)^2 when array B consists of the i~jth elements of A ( 0<=i<=j<N ).

The minimum measure of the question is to calculate Q(N,P). And we find that:

Q(N,P)=MIN{Q(N-1,P-1)+D(0,0); Q(N-2,P-1)+D(0,1); ...; Q(N-1,P-1)+D(0,N-P)}

So it like can be solved by dynamic programming.

 Q(i,1) = D(N-i,N-1)

 Q(i,j) = MIN{ Q(i-1,j-1)+D(N-i,N-i); 
               Q(i-2,j-1)+D(N-i,N-i+1); 
               ...; 
               Q(j-1,j-1)+D(N-i,N-j)}

So the algorithm step is:

 1. Cal j=1:

    Q(1,1), Q(2,1)... Q(3,1)

 2. Cal j=2:

    Q(2,2) = MIN{Q(1,1)+D(N-2,N-2)};

    Q(3,2) = MIN{Q(2,1)+D(N-3,N-3); Q(1,1)+D(N-3,N-2)}

    Q(4,2) = MIN{Q(3,1)+D(N-4,N-4); Q(2,1)+D(N-4,N-3); Q(1,1)+D(N-4,N-2)}

 ... Cal j=...

 P. Cal j=P:

    Q(P,P), Q(P+1,P)...Q(N,P)

The final minimum Measure value is stored as Q(N,P)! 
To trace each subarray's length, you can store the 
MIN choice when calculate Q(i,j)=MIN{Q+D...}

space for D(i,j);

time for calculate Q(N,P)

compared to the pure brute-forcing algorithm consumes time.

like image 23
Harold Gao Avatar answered Sep 10 '25 23:09

Harold Gao