Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proportionately split dataframe with multiple target columns

I have a dataframe with 30 rows and 10 columns. 5 of the columns are input features and the other 5 are output/target columns. The target columns contain classes represented as 0, 1, 2. I want to split the dataset into train and test such that, in the train set, for each output column, the proportion of class 1 is between 0.15 and 0.3. (I am not bothered about the distribution of classes in the test set).

ADDITIONAL CONTEXT: I am trying to balance the output classes in a multi-class and multi-output dataset. My understanding is that this would be an optimization problem with 25 (?) degrees of freedom. So if I have any input dataset, I would be able to create a subset of that input dataset which is my training data and which has the desired class balance (i.e class 1 between 0.15 and 0.3 for each output column).

I make the dataframe using this

import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split

np.random.seed(42)
data = pd.DataFrame({
    'A': np.random.rand(30),
    'B': np.random.rand(30),
    'C': np.random.rand(30),
    'D': np.random.rand(30),
    'E': np.random.rand(30),
    'F': np.random.choice([0, 1, 2], 30),
    'G': np.random.choice([0, 1, 2], 30),
    'H': np.random.choice([0, 1, 2], 30),
    'I': np.random.choice([0, 1, 2], 30),
    'J': np.random.choice([0, 1, 2], 30)
})

My current silly/harebrained solution for this problem involves using two separate functions. I have a helper function that checks if the proportions of class 1 in each column is within my desired range

def check_proportions(df, cols, min_prop = 0.15, max_prop = 0.3, class_category = 1):
    for col in cols:
        prop = (df[col] == class_category).mean()
        if not (min_prop <= prop <= max_prop):
            return False
    return True
def proportionately_split_data(data, target_cols, min_prop = 0.15, max_prop = 0.3):
    while True:
        random_state = np.random.randint(100_000)
        train_df, test_df = train_test_split(data, test_size = 0.3, random_state = random_state)
        if check_proportions(train_df, target_cols, min_prop, max_prop):
            return train_df, test_df

Finally, I run the code using

target_cols = ["F", "G", "H", "I", "J"]

train, test = proportionately_split_data(data, target_cols)

My worry with this current "solution" is that it is probabilistic and not deterministic. I can see the proportionately_split_data getting stuck in an infinite loop if none of the random state I set in train_test_split can randomly generate data with the desired proportion. Any help would be much appreciated!

I apologize for not providing this earlier, for a Minimal working example, the input (data) could be

A B C D E OUTPUT_1 OUTPUT_2 OUTPUT_3 OUTPUT_4 OUTPUT_5
5.65 3.56 0.94 9.23 6.43 0 1 1 0 1
7.43 3.95 1.24 7.22 2.66 0 0 0 1 2
9.31 2.42 2.91 2.64 6.28 2 1 2 2 0
8.19 5.12 1.32 3.12 8.41 1 2 0 1 2
9.35 1.92 3.12 4.13 3.14 0 1 1 0 1
8.43 9.72 7.23 8.29 9.18 1 0 0 2 2
4.32 2.12 3.84 9.42 8.19 0 0 0 0 0
3.92 3.91 2.90 8.19 8.41 2 2 2 2 1
7.89 1.92 4.12 8.19 7.28 1 1 2 0 2
5.21 2.42 3.10 0.31 1.31 2 0 1 1 0

which has 10 rows and 10 columns,

and an expected output (train set) could be

A B C D E OUTPUT_1 OUTPUT_2 OUTPUT_3 OUTPUT_4 OUTPUT_5
5.65 3.56 0.94 9.23 6.43 0 1 1 0 1
7.43 3.95 1.24 7.22 2.66 0 0 0 1 2
9.31 2.42 2.91 2.64 6.28 2 1 2 2 0
8.19 5.12 1.32 3.12 8.41 1 2 0 1 2
8.43 9.72 7.23 8.29 9.18 1 0 0 2 2
3.92 3.91 2.90 8.19 8.41 2 2 2 2 1
5.21 2.42 3.10 0.31 1.31 2 0 1 1 0

Whereby each output column in the train set has at least 2 (>= 0.15 * number of rows in input data) instances of Class 1 and at most 3 (<= 0.3 * number of rows in input data). I guess I also didn't clarify that the proportion is in relation to the number of examples (or rows) in the input dataset. My test set would be the remaining rows in the input dataset.

like image 237
Caesar Avatar asked Feb 04 '26 15:02

Caesar


1 Answers

I've formulated your problem as a Linear Programming Problem (LPP) and used scipy.optimize.linprog. The optimisation was successful on both of the examples that you gave.

Your Example dataset

import numpy as np
import pandas as pd

np.random.seed(42)

data = pd.DataFrame({
    'A': np.random.rand(30),
    'B': np.random.rand(30),
    'C': np.random.rand(30),
    'D': np.random.rand(30),
    'E': np.random.rand(30),
    'OUTPUT_1': np.random.choice([0, 1, 2], 30),
    'OUTPUT_2': np.random.choice([0, 1, 2], 30),
    'OUTPUT_3': np.random.choice([0, 1, 2], 30),
    'OUTPUT_4': np.random.choice([0, 1, 2], 30),
    'OUTPUT_5': np.random.choice([0, 1, 2], 30)
})

Solution - which will be magnitudes faster than your original iterative solution. I've left the configurable values that you can play around with at the top.

import math
from scipy.optimize import linprog

OUT_COLS = ["OUTPUT_1", "OUTPUT_2", "OUTPUT_3", "OUTPUT_4", "OUTPUT_5"]
LB_MULT = 0.15  # lower bound muliplier
UB_MULT = 0.3  # upper bound multiplier
TARGET_CLASS = 1  # the target class in OUT_COLS that you wish to redistribute.
TRAIN_MULT = 0.7

total_len = len(data)
train_size = int(TRAIN_MULT * total_len)
min_class_1 = math.ceil(LB_MULT * total_len)
max_class_1 = int(UB_MULT * total_len)

for col in OUT_COLS:
    if data[col].value_counts().loc[1] < min_class_1:
        print(f"solution infeasible due to insufficent class 1s in col {col}")

# set to zero - feasability over optimality
c = np.zeros(total_len)

# equals constraint
# sum of all rows must be train size (row = 0 if not in train, 1 otherwise)
A_eq, b_eq = np.ones((1, total_len)), [train_size]

# upper bound constraint
A_ub, b_ub = [], []

# each column must satisfy lb <= class 1 count <= ub
for col in OUT_COLS:
    class_1_indicator = (data[col] == TARGET_CLASS).astype(int)
    A_ub.append(-class_1_indicator)
    b_ub.append(-min_class_1)  # -x <= -lb equivalent to x >= lb
    A_ub.append(class_1_indicator)
    b_ub.append(max_class_1)  # x <= ub

A_ub, b_ub = np.vstack(A_ub), np.array(b_ub)

# binary - 1 in train, 0 not
x_bounds = [(0, 1) for _ in range(total_len)]

# LPP
result = linprog(
    c,
    A_ub=A_ub,
    b_ub=b_ub,
    A_eq=A_eq,
    b_eq=b_eq,
    bounds=x_bounds,
    # 1 - sets LPP to integer solutions only.
    integrality=np.ones(total_len),
)

print(result)
if result.success:
    train_data = data.iloc[np.where(result.x == 1)[0]]
    test_data = data.iloc[np.where(result.x == 0)[0]]
    print(train_data)

How does it work?

The explanation will assume familiarity with LPPs. The picture is from the scipy.optimize.linprog page.

scipy linprog lpp

  • c = 0 -> tells the optimizer that we do not care about optimizing the objective function. we only want a solution that suits the constraints.
  • x -> each value in the x vector from the objective fct corresponds to a row in the table. We also set each value in x must be an integer.
  • Aeq constraint -> number of rows must be train size
  • Aub constraint -> each column must satisfy upper and lower bound for number of class occurances
  • l and u -> x to be inbetween 0 and 1. Since it also must be an integer, it can only be 0 or 1; corresponding to inclusion in the train set or not.

Hope that helped!

like image 187
dydev Avatar answered Feb 06 '26 05:02

dydev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!