I'm trying to to find a solution how I could change my pandas dataframe. I have a dataset with news headlines. There are multiple headlines per day. I would like to have the date(day) as rows and each headline gets assigned to a new column per day. In other words I would like to combine all the headline data for each date. Instead of having a separate headline for each day. Some sort of pandas custom aggregator could do the job, but I'm struggling to come up with one.
I was able to group the data by date but now all the headlines per day are in the same column. and not in separate columns. (see picture 2)
df_nyt_all.groupby(['date'], as_index = False).agg({'headline': ','.join})
I have been looking for a solution for a while now but without any luck.
I attached 3 pictures. The first picture shows what my df looked like originally.


The third picture shows an example how I would like the df to look like.

Using a small dataframe as an example:
df = pd.DataFrame({'Date':['d1','d1','d1','d2','d2'],'headline':['h1','h2','h3','h4','h5']})
we can refine your own answer as such:
df.groupby(['Date'], as_index = True).agg({'headline': ','.join})['headline'].str.split(',', expand=True)
which splits the headlines you joined by a comma into separate columns:
0 1 2
Date
d1 h1 h2 h3
d2 h4 h5 None
This is not very robust as if the headlines had commas it would split on those as well. A more robust solution is a variant of this, where we aggregate first by collecting headlines per date into a list, and then expand lists:
df.groupby('Date', as_index=True)['headline'].apply(list).apply(pd.Series).reset_index()
(here I reset the index -- you can do the same in the first solution) to get
Date 0 1 2
0 d1 h1 h2 h3
1 d2 h4 h5 NaN
To get the column names, the solution can be expanded as below. We create a dict that replaces 'n' with f'Top{n}' for n from 0 to the maximum column index
df2 = df.groupby('Date', as_index=True)['headline'].apply(list).apply(pd.Series).reset_index()
new_col_names = {n:f'Top{n}' for n in range(len(df2.columns))}
df2.rename(columns = new_col_names, inplace = True)
df2
produces
Date Top0 Top1 Top2
0 d1 h1 h2 h3
1 d2 h4 h5 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With