Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Panda 0.22 dataframe.drop more rows than it should

Loaded two dataframes(testdf and datadf) from two files and then used

df = pd.concat([testdf,datadf]) produces a df.shape of (48842,15) so far so good.

Now I need 80% train, 10% test, 10% validation

trndf = df.sample(frac=0.8)returns a shape of (39074,15) which is correct.

tmpdf = df.drop(trndf.index)now the idea here is to remove those 39074 rows from the df dataframe which should leave a total of 9768. However tmpdf dataframe shape is (4514,15) losing 5254 rows.

df uses a default index which is numbered from 0 to 48841 a sample below

idx age work class 0 25 Private 1 28 Private

The trndf dataframe sample below which is random sample and I confirmed the index numbers matched the index in df dataframe

idx age work class 228 25 ? 2164 35 State-gov

Open to ideas on how it managed to lose those extra rows. Appreciate any insight on this. Thanks

like image 832
Richard Wheeler Avatar asked Oct 19 '25 03:10

Richard Wheeler


1 Answers

By default pd.concat does not reset the indices, and thus if there are indices that exist in both testdf and datadf, they would both get dropped in the same time when such indices are sampled out.

drop will drop all the duplicated indices and thus you lose more rows from the indices that exist both in testdf and datadf.

Potential solutions are changing df = pd.concat([testdf,datadf]) to

df = pd.concat([testdf,datadf]).reset_index()

or

df = pd.concat([testdf,datadf], ignore_index=True)

Problem Reproduced:

df = pd.DataFrame({'a': {0: 0.6987303529918656,
  1: -1.4637804486869905,
  2: 0.4512092453413682,
  3: 0.03898323021771516,
  4: -0.143758037238284,
  5: -1.6277278110578157}})

df_combined = pd.concat([df, df])
print(df_combined)
print(df_combined.shape)
sample = df_combined.sample(frac=0.5)
print(sample.shape)
df_combined.drop(sample.index).shape

          a
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
(12, 1) # print(df_combined.shape)
(6, 1)  # print(sample.shape)
Out[37]:
(4, 1)  # df_combined.drop(sample.index).shape
like image 193
Tai Avatar answered Oct 21 '25 17:10

Tai