Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate random array of integers with known total

Consider the following toy array with integers ranging from 5 - 25:

a = np.array([12, 18, 21])

How can I generate arrays of 5 random integers ranging from 1 - 5, the sum of which equals each number in array a? The solution should generate a uniform distribution over all possible outputs.

So far, I have managed to create a simple function, which will produce 5 random integers:

import numpy as np

def foo(a, b):
    p = np.ones(b) / b
    return np.random.multinomial(a, p, size = 1)

Example using the values from array a*:

In [1]: foo(12, 5)
Out[1]: array([[1, 4, 3, 2, 2]])

In [2]: foo(18, 5)
Out[2]: array([[2, 2, 3, 3, 8]])

In [3]: foo(21, 5)
Out[3]: array([[6, 5, 3, 4, 3]])

Clearly, these integers have the required total but they are not always bounded between 1 and 5.

Expected output would be something along the following lines:

In [4]: foo(np.array([12, 18, 21]), 5)
Out[4]: 
array([[1, 4, 3, 2, 2],
       [4, 3, 3, 3, 5],
       [5, 5, 4, 4, 3]])

*the np.multinomial()) function only takes integers as an argument.


2 Answers

Here is an exact (every legal sum has same probability) solution. It uses an enumeration of all legal sums, not in the sense that we go through each and every sum, but rather given a number n we can directly compute the n-th sum in the enumeration. As we also know the total number of legal sums, we can simply draw uniform integers and transform them:

import numpy as np
import functools as ft

#partition count
@ft.lru_cache(None)
def capped_pc(N,k,m):
    if N < 0:
        return 0
    if k == 0:
        return int(N<=m)
    return sum(capped_pc(N-i,k-1,m) for i in range(m+1))

capped_pc_v = np.vectorize(capped_pc)

def random_capped_partition(low,high,n,total,size=1):
    total = total - n*low
    high = high - low
    if total > n*high or total < 0:
        raise ValueError
    idx = np.random.randint(0,capped_pc(total,n-1,high),size)
    total = np.broadcast_to(total,(size,1))
    out = np.empty((size,n),int)
    for j in range(n-1):
        freqs = capped_pc_v(total-np.arange(high+1),n-2-j,high)
        freqs_ps = np.c_[np.zeros(size,int),freqs.cumsum(axis=1)]
        out[:,j] = [f.searchsorted(i,"right") for f,i in zip(freqs_ps[:,1:],idx)]
        idx = idx - np.take_along_axis(freqs_ps,out[:,j,None],1).ravel()
        total = total - out[:,j,None]
    out[:,-1] = total.ravel()
    return out + low

Demo:

# 4 values between 1 and 5 summing to 12
# a million samples takes a few seconds
x = random_capped_partition(1,5,4,12,1000000)

# sanity checks

# values between 1 and 5
x.min(),x.max()
# (1, 5)

# all legal sums occur
# count them brute force
sum(1 for a in range(1,6) for b in range(1,6) for c in range(1,6) if 7 <=     a+b+c <= 11)
# 85
# and count unique samples
len(np.unique(x,axis=0))
# 85

# check for uniformity
np.unique(x, axis=0, return_counts=True)[1]
# array([11884, 11858, 11659, 11544, 11776, 11625, 11813, 11784, 11733,
#        11699, 11820, 11802, 11844, 11807, 11928, 11641, 11701, 12084,
#        11691, 11793, 11857, 11608, 11895, 11839, 11708, 11785, 11764,
#        11736, 11670, 11804, 11821, 11818, 11798, 11587, 11837, 11759,
#        11707, 11759, 11761, 11755, 11663, 11747, 11729, 11758, 11699,
#        11672, 11630, 11789, 11646, 11850, 11670, 11607, 11860, 11772,
#        11716, 11995, 11802, 11865, 11855, 11622, 11679, 11757, 11831,
#        11737, 11629, 11714, 11874, 11793, 11907, 11887, 11568, 11741,
#        11932, 11639, 11716, 12070, 11746, 11787, 11672, 11643, 11798,
#        11709, 11866, 11851, 11753])

Brief explanation:

We use a simple recurrence to compute the total number of capped partitions. We split on the first bin, i.e we fix the number in the first bin and by recurrence retrieve the number of ways to fill the remaining bins. Then we simply sum over the different first bin options. We use a cache decorator to keep the recursion under control. This decorator remembers all parameter combinations we've already done, so when we arrive at the same parameters via different recursion paths we don't have to do the same computation again.

The enumeration works similarly. Assume lexicographic order. How to find the n-th partition? Again, split on the first bin. Because we know for each value the first bin can take the total number of ways the remaining bins can be populated we can form the accumulated sum and then see where n fits in. If n lies between the i-th and i+1st partial sum the the first index is i+low, we subtract the i-th sum from n and start over on the remaining bins.

like image 87
Paul Panzer Avatar answered Nov 17 '25 22:11

Paul Panzer


The following differs from my other answer in that it guarantees uniform distribution over all possible outputs, but is probably only practical for small inputs (where the space of possible outputs is fairly small):

import random

def gen_all(low, high, n, total):
  """Yields all possible n-tuples of [low; high] ints that sum up to total."""
  if n == 0:
    yield ()
    return
  adj_low = max(low, total - (n - 1) * high)
  adj_high = min(high, total - (n - 1) * low)
  for val in range(adj_low, adj_high + 1):
    for arr in gen_all(low, high, n - 1, total - val):
      yield (val,) + arr

def gen(low, high, n, total):
  return random.choice(list(gen_all(low, high, n, total)))

print(gen(low=1, high=5, n=5, total=5))
print(gen(low=1, high=5, n=5, total=12))
print(gen(low=1, high=5, n=5, total=18))
print(gen(low=1, high=5, n=5, total=25))

This materialises all possible outputs into a list before picking one. In practice, one would probably use reservoir sampling (of size 1).

If you need to generate multiple random sets with the same parameters, reservoir sampling with replacement might be the answer: https://epubs.siam.org/doi/pdf/10.1137/1.9781611972740.53

like image 20
NPE Avatar answered Nov 17 '25 20:11

NPE