Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add column to DataFrame based on grouped values

Tags:

python

pandas

I have this DataFrame:

df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
                   'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
                   'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3],
                   'clicks': [100, 200, 50, 0, 20, 30, 10, 0, 20]})
#   site  day  hour  clicks
# 0    a    1     1     100
# 1    a    1     2     200
# 2    a    1     3      50
# 3    b    1     1       0
# 4    b    1     2      20
# 5    b    1     3      30
# 6    a    2     1      10
# 7    a    2     2       0
# 8    a    2     3      20

What I want to achieve: Group these data by 'site' and 'day' and add a field (has_hour_1_clicks) that indicates if for that site/day the value of clicks at 'hour'==1 was more than 0. I understand that for the grouping I basically need this:

df.groupby(['site', 'day'])

But I can't find a way to get what I want with transform(), filter() or aggr().

So the DataFrame I want to get out of this would look like this:

#   site  day  hour  clicks  has_hour_1_clicks
# 0    a    1     1     100               True
# 1    a    1     2     200               True
# 2    a    1     3      50               True
# 3    b    1     1       0              False
# 4    b    1     2      20              False
# 5    b    1     3      30              False
# 6    a    2     1      10               True
# 7    a    2     2       0               True
# 8    a    2     3      20               True
like image 870
Ch2231 Avatar asked Nov 30 '25 08:11

Ch2231


1 Answers

This looks like a job for GroupBy.transform:

(df.eval('has_clicks = hour == 1 and clicks > 0')
   .groupby(['site', 'day'])['has_clicks']
   .transform('any'))                                     

0     True
1     True
2     True
3    False
4    False
5    False
6     True
7     True
8     True
Name: has_clicks, dtype: bool

How It Works

First, find rows that match your condition:

# eval statement is doing exactly this, but more succinctly
df.assign(has_clicks=(df['hour'] == 1) & (df['clicks'] > 0))

  site  day  hour  clicks  has_clicks
0    a    1     1     100        True
1    a    1     2     200       False
2    a    1     3      50       False
3    b    1     1       0       False
4    b    1     2      20       False
5    b    1     3      30       False
6    a    2     1      10        True
7    a    2     2       0       False
8    a    2     3      20       False

Next, take the "has_clicks" column, and transform all rows in that group to "True" if any of the rows in that group is true. This is what transform('any') is doing:

_.groupby(['site', 'day'])['has_clicks'].transform('any')

0     True
1     True
2     True
3    False
4    False
5    False
6     True
7     True
8     True
Name: has_clicks, dtype: bool
like image 70
cs95 Avatar answered Dec 01 '25 21:12

cs95