I have a dataset of ~2m observations which I need to split into training, validation and test sets in the ratio 60:20:20. A simplified excerpt of my dataset looks like this:
+---------+------------+-----------+-----------+
| note_id | subject_id | category  |   note    |
+---------+------------+-----------+-----------+
|       1 |          1 | ECG       | blah ...  |
|       2 |          1 | Discharge | blah ...  |
|       3 |          1 | Nursing   | blah ...  |
|       4 |          2 | Nursing   | blah ...  |
|       5 |          2 | Nursing   | blah ...  |
|       6 |          3 | ECG       | blah ...  |
+---------+------------+-----------+-----------+
There are multiple categories - which are not evenly balanced - so I need to ensure that the training, validation and test sets all have the same proportions of categories as in the original dataset. This part is fine, I can just use StratifiedShuffleSplit from the sklearn library.
However, I also need to ensure that the observations from each subject are not split across the training, validation and test datasets. All the observations from a given subject need to be in the same bucket to ensure my trained model has never seen the subject before when it comes to validation/testing. E.g. every observation of subject_id 1 should be in the training set.
I can't think of a way to ensure a stratified split by category, prevent contamination (for want of a better word) of subject_id across datasets, ensure a 60:20:20 split and ensure that the dataset is somehow shuffled. Any help would be appreciated!
Thanks!
EDIT:
I've now learnt that grouping by a category and keeping groups together across dataset splits can also be accomplished by sklearn through the GroupShuffleSplit function. So essentially, what I need is a combined stratified and grouped shuffle split i.e. StratifiedGroupShuffleSplit which does not exist. Github issue: https://github.com/scikit-learn/scikit-learn/issues/12076 
The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
Essentially I need StratifiedGroupShuffleSplit which does not exist (Github issue). This is because the behaviour of such a function is unclear and accomplishing this to yield a dataset which is both grouped and stratified is not always possible (also discussed here) - especially with a heavily imbalanced dataset such as mine. In my case, I want grouping to be done strictly to ensure there is no overlap of groups whatsoever whilst stratification and the dataset ratio split of 60:20:20 to be done approximately i.e. as well as is possible.
As Ghanem mentions, I have no choice but to build a function to split the dataset myself, which I have done below:
def StratifiedGroupShuffleSplit(df_main):
    df_main = df_main.reindex(np.random.permutation(df_main.index)) # shuffle dataset
    # create empty train, val and test datasets
    df_train = pd.DataFrame()
    df_val = pd.DataFrame()
    df_test = pd.DataFrame()
    hparam_mse_wgt = 0.1 # must be between 0 and 1
    assert(0 <= hparam_mse_wgt <= 1)
    train_proportion = 0.6 # must be between 0 and 1
    assert(0 <= train_proportion <= 1)
    val_test_proportion = (1-train_proportion)/2
    subject_grouped_df_main = df_main.groupby(['subject_id'], sort=False, as_index=False)
    category_grouped_df_main = df_main.groupby('category').count()[['subject_id']]/len(df_main)*100
    def calc_mse_loss(df):
        grouped_df = df.groupby('category').count()[['subject_id']]/len(df)*100
        df_temp = category_grouped_df_main.join(grouped_df, on = 'category', how = 'left', lsuffix = '_main')
        df_temp.fillna(0, inplace=True)
        df_temp['diff'] = (df_temp['subject_id_main'] - df_temp['subject_id'])**2
        mse_loss = np.mean(df_temp['diff'])
        return mse_loss
    i = 0
    for _, group in subject_grouped_df_main:
        if (i < 3):
            if (i == 0):
                df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            elif (i == 1):
                df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            else:
                df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
        mse_loss_diff_train = calc_mse_loss(df_train) - calc_mse_loss(df_train.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_val = calc_mse_loss(df_val) - calc_mse_loss(df_val.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_test = calc_mse_loss(df_test) - calc_mse_loss(df_test.append(pd.DataFrame(group), ignore_index=True))
        total_records = len(df_train) + len(df_val) + len(df_test)
        len_diff_train = (train_proportion - (len(df_train)/total_records))
        len_diff_val = (val_test_proportion - (len(df_val)/total_records))
        len_diff_test = (val_test_proportion - (len(df_test)/total_records)) 
        len_loss_diff_train = len_diff_train * abs(len_diff_train)
        len_loss_diff_val = len_diff_val * abs(len_diff_val)
        len_loss_diff_test = len_diff_test * abs(len_diff_test)
        loss_train = (hparam_mse_wgt * mse_loss_diff_train) + ((1-hparam_mse_wgt) * len_loss_diff_train)
        loss_val = (hparam_mse_wgt * mse_loss_diff_val) + ((1-hparam_mse_wgt) * len_loss_diff_val)
        loss_test = (hparam_mse_wgt * mse_loss_diff_test) + ((1-hparam_mse_wgt) * len_loss_diff_test)
        if (max(loss_train,loss_val,loss_test) == loss_train):
            df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
        elif (max(loss_train,loss_val,loss_test) == loss_val):
            df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
        else:
            df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
        print ("Group " + str(i) + ". loss_train: " + str(loss_train) + " | " + "loss_val: " + str(loss_val) + " | " + "loss_test: " + str(loss_test) + " | ")
        i += 1
    return df_train, df_val, df_test
df_train, df_val, df_test = StratifiedGroupShuffleSplit(df_main)
I have created some arbitrary loss function based on 2 things:
Weighting these two inputs to the loss function is done by the static hyperparameter hparam_mse_wgt. For my particular dataset, a value of 0.1 worked well but I would encourage you to play around with it if you use this function. Setting it to 0 will prioritise only maintaining the split ratio and ignore the stratification. Setting it to 1 would be vice versa.
Using this loss function, I then iterate through each subject (group) and append it to the appropriate dataset (training, validation or test) according to whichever has the highest loss function.
It's not particularly complicated but it does the job for me. It won't necessarily work for every dataset, but the larger it is, the better the chance. Hopefully someone else will find it useful.
this got more than a year, but i found my self in a similare situation where i have labels and a groups, and due to the nature of the groups one group of data points can be either in test only or in train only, i've wrote this a small algo using pandas and sklearn i hope this would help
from sklearn.model_selection import GroupShuffleSplit
groups = df.groupby('label')
all_train = []
all_test = []
for group_id, group in groups:
    # if a group is already taken in test or train it must stay there
    group = group[~group['groups'].isin(all_train+all_test)]
    # if group is empty 
    if group.shape[0] == 0:
        continue
    train_inds, test_inds = next(GroupShuffleSplit(
        test_size=valid_size, n_splits=2, random_state=7).split(group, groups=group['groups']))
    all_train += group.iloc[train_inds]['groups'].tolist()
    all_test += group.iloc[test_inds]['groups'].tolist()
train= df[df['groups'].isin(all_train)]
test= df[df['groups'].isin(all_test)]
form_train = set(train['groups'].tolist())
form_test = set(test['groups'].tolist())
inter = form_train.intersection(form_test)
print(df.groupby('label').count())
print(train.groupby('label').count())
print(test.groupby('label').count())
print(inter) # this should be empty
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With