Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort pandas by Date, custom aggregator: combine all the data for each date

I'm trying to to find a solution how I could change my pandas dataframe. I have a dataset with news headlines. There are multiple headlines per day. I would like to have the date(day) as rows and each headline gets assigned to a new column per day. In other words I would like to combine all the headline data for each date. Instead of having a separate headline for each day. Some sort of pandas custom aggregator could do the job, but I'm struggling to come up with one.

I was able to group the data by date but now all the headlines per day are in the same column. and not in separate columns. (see picture 2)

df_nyt_all.groupby(['date'], as_index = False).agg({'headline': ','.join})

I have been looking for a solution for a while now but without any luck.

I attached 3 pictures. The first picture shows what my df looked like originally. current df

current df

The third picture shows an example how I would like the df to look like.

how the df should look like

like image 828
liaison Avatar asked Dec 05 '25 18:12

liaison


1 Answers

Using a small dataframe as an example:

df = pd.DataFrame({'Date':['d1','d1','d1','d2','d2'],'headline':['h1','h2','h3','h4','h5']})

we can refine your own answer as such:

df.groupby(['Date'], as_index = True).agg({'headline': ','.join})['headline'].str.split(',', expand=True)

which splits the headlines you joined by a comma into separate columns:

    0   1   2
Date            
d1  h1  h2  h3
d2  h4  h5  None

This is not very robust as if the headlines had commas it would split on those as well. A more robust solution is a variant of this, where we aggregate first by collecting headlines per date into a list, and then expand lists:

df.groupby('Date', as_index=True)['headline'].apply(list).apply(pd.Series).reset_index()

(here I reset the index -- you can do the same in the first solution) to get

    Date    0   1   2
0   d1      h1  h2  h3
1   d2      h4  h5  NaN

To get the column names, the solution can be expanded as below. We create a dict that replaces 'n' with f'Top{n}' for n from 0 to the maximum column index

df2 = df.groupby('Date', as_index=True)['headline'].apply(list).apply(pd.Series).reset_index()
new_col_names = {n:f'Top{n}' for n in range(len(df2.columns))}
df2.rename(columns = new_col_names, inplace = True)
df2

produces

Date    Top0    Top1    Top2
0 d1    h1      h2      h3
1 d2    h4      h5      NaN
like image 179
piterbarg Avatar answered Dec 08 '25 07:12

piterbarg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!