pandas- grouping and aggregating consecutive rows with same value in column

Question

I have a pandas DataFrame from a long list of datetime ranges pulled from a database, each range with a label. The dates are ordered such that the start date of one row, is the end date of the row before. A workable example is here:

import pandas as pd

bins = [{'start': '2020-01-12 00:00:00', 'end': '2020-01-13 00:00:00', 'label': 't3'},
        {'start': '2020-01-13 00:00:00', 'end': '2020-01-13 07:00:00', 'label': 't2'},
        {'start': '2020-01-13 07:00:00', 'end': '2020-01-13 15:30:00', 'label': 't1'},
        {'start': '2020-01-13 15:30:00', 'end': '2020-01-14 00:00:00', 'label': 't2'},
        {'start': '2020-01-14 00:00:00', 'end': '2020-01-14 07:00:00', 'label': 't2'},
        {'start': '2020-01-14 07:00:00', 'end': '2020-01-14 15:30:00', 'label': 't1'},
        {'start': '2020-01-14 15:30:00', 'end': '2020-01-15 00:00:00', 'label': 't2'},
        {'start': '2020-01-15 00:00:00', 'end': '2020-01-15 07:00:00', 'label': 't2'},
        {'start': '2020-01-15 07:00:00', 'end': '2020-01-15 15:30:00', 'label': 't1'},
        {'start': '2020-01-15 15:30:00', 'end': '2020-01-16 00:00:00', 'label': 't2'},
        {'start': '2020-01-16 00:00:00', 'end': '2020-01-16 07:00:00', 'label': 't2'},
        {'start': '2020-01-16 07:00:00', 'end': '2020-01-16 15:30:00', 'label': 't1'},
        {'start': '2020-01-16 15:30:00', 'end': '2020-01-17 00:00:00', 'label': 't2'},
        {'start': '2020-01-17 00:00:00', 'end': '2020-01-17 07:00:00', 'label': 't2'},
        {'start': '2020-01-17 07:00:00', 'end': '2020-01-17 15:30:00', 'label': 't1'},
        {'start': '2020-01-17 15:30:00', 'end': '2020-01-18 00:00:00', 'label': 't2'},
        {'start': '2020-01-18 00:00:00', 'end': '2020-01-19 00:00:00', 'label': 't2'}]
bins_df = pd.DataFrame(bins)

Notice that some labels are repeated consecutively, for example, the 4th and 5th row, have the same label. Thus, the label 't2' applies to the range from 2020-01-13 15:30:00 to 2020-01-14 07:00:00. Using pandas, how can I group/aggregate consecutive rows with the same label, and take the minimum start, and maximum end to combine consecutive date ranges with the same label?

Erfan · Accepted Answer

First we use Series.shift with Series.cumsum to make a group indicator for each consecutive label value.

Then we use groupby.agg with min and max.

label_groups = bins_df['label'].ne(bins_df['label'].shift()).cumsum()

df = (
    bins_df.groupby(label_groups).agg({'start':'min', 'end':'max', 'label':'first'})
           .reset_index(drop=True)
)

                 start                 end label
0  2020-01-12 00:00:00 2020-01-13 00:00:00    t3
1  2020-01-13 00:00:00 2020-01-13 07:00:00    t2
2  2020-01-13 07:00:00 2020-01-13 15:30:00    t1
3  2020-01-13 15:30:00 2020-01-14 07:00:00    t2
4  2020-01-14 07:00:00 2020-01-14 15:30:00    t1
5  2020-01-14 15:30:00 2020-01-15 07:00:00    t2
6  2020-01-15 07:00:00 2020-01-15 15:30:00    t1
7  2020-01-15 15:30:00 2020-01-16 07:00:00    t2
8  2020-01-16 07:00:00 2020-01-16 15:30:00    t1
9  2020-01-16 15:30:00 2020-01-17 07:00:00    t2
10 2020-01-17 07:00:00 2020-01-17 15:30:00    t1
11 2020-01-17 15:30:00 2020-01-19 00:00:00    t2

pandas- grouping and aggregating consecutive rows with same value in column

Tags:

pandas

dataframe

aggregation

django

pandas-groupby

MarkD

1 Answers

Erfan

Recent Activity

Donate For Us

pandas- grouping and aggregating consecutive rows with same value in column

Tags:

pandas

dataframe

aggregation

django

pandas-groupby

MarkD

1 Answers

Erfan

Related questions

Recent Activity

Donate For Us