This is my DataFrame:
dt value
2000-01-01 00:00:00 296.0
2000-01-01 00:05:00 296.0
2000-01-01 00:10:00 296.0
2000-01-01 00:15:00 296.25
2000-01-01 00:20:00 296.25
2000-01-01 00:25:00 296.25
2000-01-01 00:30:00 296.25
2000-01-01 00:35:00 296.25
2000-01-01 00:40:00 296.25
2000-01-01 00:45:00 296.5
2000-01-01 00:50:00 296.5
2000-01-01 00:55:00 296.5
2000-01-01 01:00:00 296.5
2000-01-01 01:05:00 296.5
2000-01-01 01:10:00 296.5
2000-01-01 01:15:00 296.75
2000-01-01 01:20:00 296.75
2000-01-01 01:50:00 297.0
2000-01-01 01:55:00 297.0
2000-01-01 02:00:00 297.0
2000-01-01 02:05:00 297.0
2000-01-01 02:10:00 297.0
2000-01-01 02:15:00 297.0
I want to remove adjacent duplicates.
The duplicate in the middle should remain. If the number of duplicates is even, take the next one from the middle. If there are 2 duplicates, take second.
Expected output:
dt value
2000-01-01 00:05:00 296.0
2000-01-01 00:30:00 296.25
2000-01-01 01:00:00 296.5
2000-01-01 01:20:00 296.75
2000-01-01 02:05:00 297.0
I read this post about duplicates, but it doesn't satisfy my conditions of choosing the middle element.
Pandas: Drop consecutive duplicates
Use -
df.groupby(['value'])['dt'].apply(lambda x: x.iloc[math.floor(len(x)/2)])
Output
value
296.00 2000-01-01 00:05:00
296.25 2000-01-01 00:30:00
296.50 2000-01-01 01:00:00
296.75 2000-01-01 01:20:00
297.00 2000-01-01 02:05:00
Name: dt, dtype: object
Needless to say, you have to import math
Timings
@ALollz
3.82 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This one
2.09 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To avoid the slower groupby + apply(lambda ...) we can instead use the built-in methods implemented in cython. Form groups of consecutive values checking against the shifted Series and then use aggregate those groups with size and cumcount to subset the original DataFrame. (Bonus: this preserves the original DataFrame index)
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
df[gp.transform('size')//2 == gp.cumcount()]
dt value
1 2000-01-01 00:05:00 296.00
6 2000-01-01 00:30:00 296.25
12 2000-01-01 01:00:00 296.50
16 2000-01-01 01:20:00 296.75
20 2000-01-01 02:05:00 297.00
To illustrate the relative performance here's an example. I create a test DataFrame where each value is repeated twice (just for simplicity) and we can see how these two methods scale with timings as a function of the number of groups (which is the important number for scaling of groupby + apply).
import perfplot
import pandas as pd
import numpy as np
def groupby_apply(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
return df.groupby(gp).apply(lambda x: x.iloc[len(x) // 2])
def groupby_builtin(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
return df[gp.transform('size')//2 == gp.cumcount()]
perfplot.show(
setup=lambda N: pd.DataFrame({'dt': range(N),
'value': np.repeat(np.arange(N//2), 2)}),
kernels=[
lambda df: groupby_apply(df),
lambda df: groupby_builtin(df)],
labels=['apply', 'built-in'],
n_range=[2 ** k for k in range(1, 20)],
equality_check=np.allclose,
xlabel='Proxy for # of Groups'
)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With