Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract subset from pandas dataframes ensuring no overlap?

Suppose I have 2 Pandas dataframes df with 297232 x 122 dimensions and df_raw with 840380x122 dimensions. df is already a subset of df_raw. Both dataframes have the index as DateTime. I would like to sample 70% of values from df, and 30% of values from df_raw (can be randomly sampled if need), while ensuring that the sampled dataframe subsets do not have overlaps in terms of indexes.

To be more precise, df_subset will have 70% randomly selected values from df, and df_raw_subset have 30% randomly selected values from df_raw, but df_subset and df_raw_subset should not contain overlaps in terms of rows which were sampled, i.e. they should have unique DateTime indices.

like image 944
JChat Avatar asked Sep 06 '25 03:09

JChat


1 Answers

So fist we sample from df, since the size is small , when we drop it in the future from another bigger df , we will not have the problem : do not have enough data point to sample

df_sub=df.sample(frac=0.7, replace=False)

Then we drop the index in df_raw by df_sub

n=int(len(df_raw)*0.3)
df_raw_sub=df_raw.drop(df_sub.index).sample(n,replace=False)
like image 176
BENY Avatar answered Sep 07 '25 23:09

BENY