Is there a way to keep rows with a specific condition, and remove others if this condition is not met?

Question

I have the following dataframe (df)

                ID  start                        end                     Diff
                A   1/8/2020 12:00:00 AM         1/8/2020 12:00:10 AM    10
                A   1/8/2020 12:00:05 AM         1/8/2020 12:00:10 AM    5
                B   1/9/2020 1:00:05 AM          1/9/2020 1:00:10 AM     5
                B   1/9/2020 1:00:06 AM          1/9/2020 1:00:10 AM     4
                B   1/9/2020 1:00:20 AM          1/9/2020 1:00:25 AM     5
                C   1/10/2020 5:00:05 AM         1/10/2020 5:00:25 AM    20
                C   1/10/2020 5:00:40 AM         1/10/2020 5:00:45 AM    5

Here is the problem: When the end times are identical, I would like to delete the row with the longer duration, deleting the row with the shortest duration.

Desired outcome:

                ID  start                      end                     Diff
                A   1/8/2020 12:00:05 AM       1/8/2020 12:00:10 AM    5
                B   1/9/2020 1:00:06 AM        1/9/2020 1:00:10 AM     4
                B   1/9/2020 1:00:20 AM        1/9/2020 1:00:25 AM     5
                C   1/10/2020 5:00:05 AM       1/10/2020 5:00:25 AM    20
                C   1/10/2020 5:00:40 AM       1/10/2020 5:00:45 AM    5

Essentially, when the end time is identical, I wish to remove the row that has the longer duration. I have tried this, however, it is not taking into account the condition: Retain the shorter duration row when the end time is identical

                df.sort_values(['Diff']).drop_duplicates(subset=['ID'])

Any suggestion(s) is appreciated.

anky · Accepted Answer

Use groupby on end column transformed with the minimum value of Diff , then compare with df['Diff'] and keep those which return True, check how transform returns the minimum over an entire group below:

df[df['Diff'].eq(df.groupby('end')['Diff'].transform('min'))]

  ID                 start                   end  Diff
1  A  1/8/2020 12:00:05 AM  1/8/2020 12:00:10 AM     5
3  B   1/9/2020 1:00:06 AM   1/9/2020 1:00:10 AM     4
4  B   1/9/2020 1:00:20 AM   1/9/2020 1:00:25 AM     5
5  C  1/10/2020 5:00:05 AM  1/10/2020 5:00:25 AM    20
6  C  1/10/2020 5:00:40 AM  1/10/2020 5:00:45 AM     5

Output of groupby+transform

print(df.groupby('end')['Diff'].transform('min'))

0     5
1     5
2     4
3     4
4     5
5    20
6     5

print(df['Diff'].eq(df.groupby('end')['Diff'].transform('min')))

0    False
1     True
2    False
3     True
4     True
5     True
6     True

ansev · Answer

We could use Series.map

df[df['Diff'].eq(df['end'].map(df.groupby('end')['Diff'].min()))]

  ID                 start                   end  Diff
1  A  1/8/2020 12:00:05 AM  1/8/2020 12:00:10 AM     5
3  B   1/9/2020 1:00:06 AM   1/9/2020 1:00:10 AM     4
4  B   1/9/2020 1:00:20 AM   1/9/2020 1:00:25 AM     5
5  C  1/10/2020 5:00:05 AM  1/10/2020 5:00:25 AM    20
6  C  1/10/2020 5:00:40 AM  1/10/2020 5:00:45 AM     5

Is there a way to keep rows with a specific condition, and remove others if this condition is not met?

Tags:

python

pandas

dataframe

duplicates

Lynn

2 Answers

anky

ansev

Recent Activity

Donate For Us

Is there a way to keep rows with a specific condition, and remove others if this condition is not met?

Tags:

python

pandas

dataframe

duplicates

Lynn

2 Answers

anky

ansev

Related questions

Recent Activity

Donate For Us