Loaded two dataframes(testdf and datadf) from two files and then used
df = pd.concat([testdf,datadf])
produces a df.shape of (48842,15) so far so good.
Now I need 80% train, 10% test, 10% validation
trndf = df.sample(frac=0.8)
returns a shape of (39074,15) which is correct.
tmpdf = df.drop(trndf.index)
now the idea here is to remove those 39074 rows from the df dataframe which should leave a total of 9768. However tmpdf dataframe shape is (4514,15) losing 5254 rows.
df uses a default index which is numbered from 0 to 48841 a sample below
idx age work class
0 25 Private
1 28 Private
The trndf dataframe sample below which is random sample and I confirmed the index numbers matched the index in df dataframe
idx age work class
228 25 ?
2164 35 State-gov
Open to ideas on how it managed to lose those extra rows. Appreciate any insight on this. Thanks
By default pd.concat
does not reset the indices, and thus if there are indices that exist in both testdf
and datadf
, they would both get dropped in the same time when such indices are sampled out.
drop
will drop all the duplicated indices and thus you lose more rows from the indices that exist both in testdf
and datadf
.
Potential solutions are changing df = pd.concat([testdf,datadf])
to
df = pd.concat([testdf,datadf]).reset_index()
or
df = pd.concat([testdf,datadf], ignore_index=True)
Problem Reproduced:
df = pd.DataFrame({'a': {0: 0.6987303529918656,
1: -1.4637804486869905,
2: 0.4512092453413682,
3: 0.03898323021771516,
4: -0.143758037238284,
5: -1.6277278110578157}})
df_combined = pd.concat([df, df])
print(df_combined)
print(df_combined.shape)
sample = df_combined.sample(frac=0.5)
print(sample.shape)
df_combined.drop(sample.index).shape
a
0 0.698730
1 -1.463780
2 0.451209
3 0.038983
4 -0.143758
5 -1.627728
0 0.698730
1 -1.463780
2 0.451209
3 0.038983
4 -0.143758
5 -1.627728
(12, 1) # print(df_combined.shape)
(6, 1) # print(sample.shape)
Out[37]:
(4, 1) # df_combined.drop(sample.index).shape
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With