Pandas group or resample dataframe excluding columns

Question

import pandas as pd
import numpy as np
data = {'dateTimeGmt': {0: pd.Timestamp('2020-01-01 06:44:00'),
      1: pd.Timestamp('2020-01-01 06:45:00'),      2: pd.Timestamp('2020-01-01 07:11:00'),      3: pd.Timestamp('2020-01-01 07:12:00'),      4: pd.Timestamp('2020-01-01 07:12:00'),      5: pd.Timestamp('2020-01-01 07:14:00'),      6: pd.Timestamp('2020-01-01 10:04:00'),      7: pd.Timestamp('2020-01-01 10:04:00'),      8: pd.Timestamp('2020-01-01 11:45:00'),      9: pd.Timestamp('2020-01-01 06:45:00')},
     'id': {0: 4, 1: 4, 2: 4, 3: 5, 4: 5, 5: 5, 6: 5, 7: 6, 8: 6, 9: 6},
     'name': {0: 'four',      1: 'four',      2: 'four',      3: 'five',      4: 'five',      5: 'five',      6: 'five',      7: 'six',      8: 'six',      9: 'six'},     'a': {0: 1.0,      1: np.nan,      2: np.nan,      3: np.nan,      4: np.nan,      5: np.nan,      6: np.nan,      7: 5.0,      8: np.nan,      9: np.nan},     'b': {0: np.nan,      1: 3.0,      2: np.nan,      3: np.nan,      4: np.nan,      5: np.nan,      6: np.nan,      7: np.nan,      8: np.nan,      9: 3.0},     'c': {0: np.nan,      1: np.nan,      2: np.nan,      3: np.nan,      4: 2.0,      5: np.nan,      6: np.nan,      7: np.nan,      8: 0.0,      9: np.nan}}
df = pd.DataFrame(data)

I would like to flatten my dataframe such that all columns after name are grouped by the hour in dateTimeGmt and then by id/name.

I tried df2 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name]).sum() This seems to work but combines all my grouping columns into the index.

df3 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name], as_index = False).sum() keeps id and name but the dateTimeGmt data is lost.

How do I group my data without losing the columns that have been grouped by?

jezrael · Accepted Answer

In your solution is necessary add rename for date and hours columns names for avoid duplicated columns names, and last DataFrame.reset_index:

df2 = (df.groupby([df.dateTimeGmt.dt.date.rename('date'),
                   df.dateTimeGmt.dt.hour.rename('h'), 'id', 'name'])
         .sum()
         .reset_index())
print (df2)
         date   h  id  name    a    b    c
0  2020-01-01   6   4  four  1.0  3.0  0.0
1  2020-01-01   6   6   six  0.0  3.0  0.0
2  2020-01-01   7   4  four  0.0  0.0  0.0
3  2020-01-01   7   5  five  0.0  0.0  2.0
4  2020-01-01  10   5  five  0.0  0.0  0.0
5  2020-01-01  10   6   six  5.0  0.0  0.0
6  2020-01-01  11   6   six  0.0  0.0  0.0

Or is possible use Grouper by hour frequency:

df2 = df.groupby([pd.Grouper(freq='H', key='dateTimeGmt'), 'id', 'name']).sum().reset_index()
print (df2)
          dateTimeGmt  id  name    a    b    c
0 2020-01-01 06:00:00   4  four  1.0  3.0  0.0
1 2020-01-01 06:00:00   6   six  0.0  3.0  0.0
2 2020-01-01 07:00:00   4  four  0.0  0.0  0.0
3 2020-01-01 07:00:00   5  five  0.0  0.0  2.0
4 2020-01-01 10:00:00   5  five  0.0  0.0  0.0
5 2020-01-01 10:00:00   6   six  5.0  0.0  0.0
6 2020-01-01 11:00:00   6   six  0.0  0.0  0.0

Pandas group or resample dataframe excluding columns

Tags:

python

pandas

dataframe

Olivia

1 Answers

jezrael

Recent Activity

Donate For Us

Pandas group or resample dataframe excluding columns

Tags:

python

pandas

dataframe

Olivia

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us