Panda 0.22 dataframe.drop more rows than it should

Question

Loaded two dataframes(testdf and datadf) from two files and then used

df = pd.concat([testdf,datadf]) produces a df.shape of (48842,15) so far so good.

Now I need 80% train, 10% test, 10% validation

trndf = df.sample(frac=0.8)returns a shape of (39074,15) which is correct.

tmpdf = df.drop(trndf.index)now the idea here is to remove those 39074 rows from the df dataframe which should leave a total of 9768. However tmpdf dataframe shape is (4514,15) losing 5254 rows.

df uses a default index which is numbered from 0 to 48841 a sample below

idx age work class 0 25 Private 1 28 Private

The trndf dataframe sample below which is random sample and I confirmed the index numbers matched the index in df dataframe

idx age work class 228 25 ? 2164 35 State-gov

Open to ideas on how it managed to lose those extra rows. Appreciate any insight on this. Thanks

Tai · Accepted Answer

By default pd.concat does not reset the indices, and thus if there are indices that exist in both testdf and datadf, they would both get dropped in the same time when such indices are sampled out.

drop will drop all the duplicated indices and thus you lose more rows from the indices that exist both in testdf and datadf.

Potential solutions are changing df = pd.concat([testdf,datadf]) to

df = pd.concat([testdf,datadf]).reset_index()

or

df = pd.concat([testdf,datadf], ignore_index=True)

Problem Reproduced:

df = pd.DataFrame({'a': {0: 0.6987303529918656,
  1: -1.4637804486869905,
  2: 0.4512092453413682,
  3: 0.03898323021771516,
  4: -0.143758037238284,
  5: -1.6277278110578157}})

df_combined = pd.concat([df, df])
print(df_combined)
print(df_combined.shape)
sample = df_combined.sample(frac=0.5)
print(sample.shape)
df_combined.drop(sample.index).shape

          a
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
(12, 1) # print(df_combined.shape)
(6, 1)  # print(sample.shape)
Out[37]:
(4, 1)  # df_combined.drop(sample.index).shape

Panda 0.22 dataframe.drop more rows than it should

Tags:

python

pandas

dataframe

Richard Wheeler

1 Answers

Tai

Recent Activity

Donate For Us

Panda 0.22 dataframe.drop more rows than it should

Tags:

python

pandas

dataframe

Richard Wheeler

1 Answers

Tai

Related questions

Recent Activity

Donate For Us