I have a dataframe with 30 rows and 10 columns. 5 of the columns are input features and the other 5 are output/target columns. The target columns contain classes represented as 0, 1, 2. I want to split the dataset into train and test such that, in the train set, for each output column, the proportion of class 1 is between 0.15 and 0.3. (I am not bothered about the distribution of classes in the test set).
ADDITIONAL CONTEXT: I am trying to balance the output classes in a multi-class and multi-output dataset. My understanding is that this would be an optimization problem with 25 (?) degrees of freedom. So if I have any input dataset, I would be able to create a subset of that input dataset which is my training data and which has the desired class balance (i.e class 1 between 0.15 and 0.3 for each output column).
I make the dataframe using this
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(42)
data = pd.DataFrame({
'A': np.random.rand(30),
'B': np.random.rand(30),
'C': np.random.rand(30),
'D': np.random.rand(30),
'E': np.random.rand(30),
'F': np.random.choice([0, 1, 2], 30),
'G': np.random.choice([0, 1, 2], 30),
'H': np.random.choice([0, 1, 2], 30),
'I': np.random.choice([0, 1, 2], 30),
'J': np.random.choice([0, 1, 2], 30)
})
My current silly/harebrained solution for this problem involves using two separate functions. I have a helper function that checks if the proportions of class 1 in each column is within my desired range
def check_proportions(df, cols, min_prop = 0.15, max_prop = 0.3, class_category = 1):
for col in cols:
prop = (df[col] == class_category).mean()
if not (min_prop <= prop <= max_prop):
return False
return True
def proportionately_split_data(data, target_cols, min_prop = 0.15, max_prop = 0.3):
while True:
random_state = np.random.randint(100_000)
train_df, test_df = train_test_split(data, test_size = 0.3, random_state = random_state)
if check_proportions(train_df, target_cols, min_prop, max_prop):
return train_df, test_df
Finally, I run the code using
target_cols = ["F", "G", "H", "I", "J"]
train, test = proportionately_split_data(data, target_cols)
My worry with this current "solution" is that it is probabilistic and not deterministic. I can see the proportionately_split_data getting stuck in an infinite loop if none of the random state I set in train_test_split can randomly generate data with the desired proportion. Any help would be much appreciated!
I apologize for not providing this earlier, for a Minimal working example, the input (data) could be
| A | B | C | D | E | OUTPUT_1 | OUTPUT_2 | OUTPUT_3 | OUTPUT_4 | OUTPUT_5 |
|---|---|---|---|---|---|---|---|---|---|
| 5.65 | 3.56 | 0.94 | 9.23 | 6.43 | 0 | 1 | 1 | 0 | 1 |
| 7.43 | 3.95 | 1.24 | 7.22 | 2.66 | 0 | 0 | 0 | 1 | 2 |
| 9.31 | 2.42 | 2.91 | 2.64 | 6.28 | 2 | 1 | 2 | 2 | 0 |
| 8.19 | 5.12 | 1.32 | 3.12 | 8.41 | 1 | 2 | 0 | 1 | 2 |
| 9.35 | 1.92 | 3.12 | 4.13 | 3.14 | 0 | 1 | 1 | 0 | 1 |
| 8.43 | 9.72 | 7.23 | 8.29 | 9.18 | 1 | 0 | 0 | 2 | 2 |
| 4.32 | 2.12 | 3.84 | 9.42 | 8.19 | 0 | 0 | 0 | 0 | 0 |
| 3.92 | 3.91 | 2.90 | 8.19 | 8.41 | 2 | 2 | 2 | 2 | 1 |
| 7.89 | 1.92 | 4.12 | 8.19 | 7.28 | 1 | 1 | 2 | 0 | 2 |
| 5.21 | 2.42 | 3.10 | 0.31 | 1.31 | 2 | 0 | 1 | 1 | 0 |
which has 10 rows and 10 columns,
and an expected output (train set) could be
| A | B | C | D | E | OUTPUT_1 | OUTPUT_2 | OUTPUT_3 | OUTPUT_4 | OUTPUT_5 |
|---|---|---|---|---|---|---|---|---|---|
| 5.65 | 3.56 | 0.94 | 9.23 | 6.43 | 0 | 1 | 1 | 0 | 1 |
| 7.43 | 3.95 | 1.24 | 7.22 | 2.66 | 0 | 0 | 0 | 1 | 2 |
| 9.31 | 2.42 | 2.91 | 2.64 | 6.28 | 2 | 1 | 2 | 2 | 0 |
| 8.19 | 5.12 | 1.32 | 3.12 | 8.41 | 1 | 2 | 0 | 1 | 2 |
| 8.43 | 9.72 | 7.23 | 8.29 | 9.18 | 1 | 0 | 0 | 2 | 2 |
| 3.92 | 3.91 | 2.90 | 8.19 | 8.41 | 2 | 2 | 2 | 2 | 1 |
| 5.21 | 2.42 | 3.10 | 0.31 | 1.31 | 2 | 0 | 1 | 1 | 0 |
Whereby each output column in the train set has at least 2 (>= 0.15 * number of rows in input data) instances of Class 1 and at most 3 (<= 0.3 * number of rows in input data). I guess I also didn't clarify that the proportion is in relation to the number of examples (or rows) in the input dataset. My test set would be the remaining rows in the input dataset.
I've formulated your problem as a Linear Programming Problem (LPP) and used scipy.optimize.linprog. The optimisation was successful on both of the examples that you gave.
Your Example dataset
import numpy as np
import pandas as pd
np.random.seed(42)
data = pd.DataFrame({
'A': np.random.rand(30),
'B': np.random.rand(30),
'C': np.random.rand(30),
'D': np.random.rand(30),
'E': np.random.rand(30),
'OUTPUT_1': np.random.choice([0, 1, 2], 30),
'OUTPUT_2': np.random.choice([0, 1, 2], 30),
'OUTPUT_3': np.random.choice([0, 1, 2], 30),
'OUTPUT_4': np.random.choice([0, 1, 2], 30),
'OUTPUT_5': np.random.choice([0, 1, 2], 30)
})
Solution - which will be magnitudes faster than your original iterative solution. I've left the configurable values that you can play around with at the top.
import math
from scipy.optimize import linprog
OUT_COLS = ["OUTPUT_1", "OUTPUT_2", "OUTPUT_3", "OUTPUT_4", "OUTPUT_5"]
LB_MULT = 0.15 # lower bound muliplier
UB_MULT = 0.3 # upper bound multiplier
TARGET_CLASS = 1 # the target class in OUT_COLS that you wish to redistribute.
TRAIN_MULT = 0.7
total_len = len(data)
train_size = int(TRAIN_MULT * total_len)
min_class_1 = math.ceil(LB_MULT * total_len)
max_class_1 = int(UB_MULT * total_len)
for col in OUT_COLS:
if data[col].value_counts().loc[1] < min_class_1:
print(f"solution infeasible due to insufficent class 1s in col {col}")
# set to zero - feasability over optimality
c = np.zeros(total_len)
# equals constraint
# sum of all rows must be train size (row = 0 if not in train, 1 otherwise)
A_eq, b_eq = np.ones((1, total_len)), [train_size]
# upper bound constraint
A_ub, b_ub = [], []
# each column must satisfy lb <= class 1 count <= ub
for col in OUT_COLS:
class_1_indicator = (data[col] == TARGET_CLASS).astype(int)
A_ub.append(-class_1_indicator)
b_ub.append(-min_class_1) # -x <= -lb equivalent to x >= lb
A_ub.append(class_1_indicator)
b_ub.append(max_class_1) # x <= ub
A_ub, b_ub = np.vstack(A_ub), np.array(b_ub)
# binary - 1 in train, 0 not
x_bounds = [(0, 1) for _ in range(total_len)]
# LPP
result = linprog(
c,
A_ub=A_ub,
b_ub=b_ub,
A_eq=A_eq,
b_eq=b_eq,
bounds=x_bounds,
# 1 - sets LPP to integer solutions only.
integrality=np.ones(total_len),
)
print(result)
if result.success:
train_data = data.iloc[np.where(result.x == 1)[0]]
test_data = data.iloc[np.where(result.x == 0)[0]]
print(train_data)
The explanation will assume familiarity with LPPs. The picture is from the scipy.optimize.linprog page.

Hope that helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With