import pandas as pd
import numpy as np
data = {'dateTimeGmt': {0: pd.Timestamp('2020-01-01 06:44:00'),
1: pd.Timestamp('2020-01-01 06:45:00'), 2: pd.Timestamp('2020-01-01 07:11:00'), 3: pd.Timestamp('2020-01-01 07:12:00'), 4: pd.Timestamp('2020-01-01 07:12:00'), 5: pd.Timestamp('2020-01-01 07:14:00'), 6: pd.Timestamp('2020-01-01 10:04:00'), 7: pd.Timestamp('2020-01-01 10:04:00'), 8: pd.Timestamp('2020-01-01 11:45:00'), 9: pd.Timestamp('2020-01-01 06:45:00')},
'id': {0: 4, 1: 4, 2: 4, 3: 5, 4: 5, 5: 5, 6: 5, 7: 6, 8: 6, 9: 6},
'name': {0: 'four', 1: 'four', 2: 'four', 3: 'five', 4: 'five', 5: 'five', 6: 'five', 7: 'six', 8: 'six', 9: 'six'}, 'a': {0: 1.0, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: 5.0, 8: np.nan, 9: np.nan}, 'b': {0: np.nan, 1: 3.0, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: 3.0}, 'c': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: 2.0, 5: np.nan, 6: np.nan, 7: np.nan, 8: 0.0, 9: np.nan}}
df = pd.DataFrame(data)
I would like to flatten my dataframe such that all columns after name are grouped by the hour in dateTimeGmt and then by id/name.
I tried df2 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name]).sum() This seems to work but combines all my grouping columns into the index.
df3 = df.groupby([df.dateTimeGmt.dt.date, df.dateTimeGmt.dt.hour, df.id, df.name], as_index = False).sum() keeps id and name but the dateTimeGmt data is lost.
How do I group my data without losing the columns that have been grouped by?
In your solution is necessary add rename for date and hours columns names for avoid duplicated columns names, and last DataFrame.reset_index:
df2 = (df.groupby([df.dateTimeGmt.dt.date.rename('date'),
df.dateTimeGmt.dt.hour.rename('h'), 'id', 'name'])
.sum()
.reset_index())
print (df2)
date h id name a b c
0 2020-01-01 6 4 four 1.0 3.0 0.0
1 2020-01-01 6 6 six 0.0 3.0 0.0
2 2020-01-01 7 4 four 0.0 0.0 0.0
3 2020-01-01 7 5 five 0.0 0.0 2.0
4 2020-01-01 10 5 five 0.0 0.0 0.0
5 2020-01-01 10 6 six 5.0 0.0 0.0
6 2020-01-01 11 6 six 0.0 0.0 0.0
Or is possible use Grouper by hour frequency:
df2 = df.groupby([pd.Grouper(freq='H', key='dateTimeGmt'), 'id', 'name']).sum().reset_index()
print (df2)
dateTimeGmt id name a b c
0 2020-01-01 06:00:00 4 four 1.0 3.0 0.0
1 2020-01-01 06:00:00 6 six 0.0 3.0 0.0
2 2020-01-01 07:00:00 4 four 0.0 0.0 0.0
3 2020-01-01 07:00:00 5 five 0.0 0.0 2.0
4 2020-01-01 10:00:00 5 five 0.0 0.0 0.0
5 2020-01-01 10:00:00 6 six 5.0 0.0 0.0
6 2020-01-01 11:00:00 6 six 0.0 0.0 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With