I have a csv file containing few attributes and one of them is the star ratings of different restaurants etoiles
(means star in french). Here annee
means the year when the rating was made.
note: I dont know how to share a Jupyter notebook table output in here, i tried different command lines but the format was always ugly. If someone could help with that.
And what i want to do is pretty simple (I think).. I want to add a new column that represents the standard deviation of the mean of stars per year of a restaurant. So I must estimate the average stars rating per year. Then, calculate the standard deviation on these values. But, I dont really know the syntax using pandas that will allow me to calculate the average star rating of a restaurant per year. Any suggestions?
I understand that I need to group the restaurants by year with .groupby('restaurant_id')['annee']
and then take the average of the stars rating of the restaurant during that year but i dont know how to write it.
# does not work
avis['newColumn'] = (
avis.groupby(['restaurant_id', 'annee'])['etoiles'].mean().std()
)
Here is a potential solution with groupby
:
#generating test data
dates = pd.date_range('20130101', periods=36, freq='M')
year = dates.strftime('%Y')
df = pd.DataFrame([np.random.randint(1,10) for x in range(36)],columns=['Rating'])
df['restaurants'] = ['R_{}'.format(i) for i in range(4)]*9
df['date'] = dates
df['year'] = year
print(df)
rating restaurants date year
0 8 R_0 2013-01-31 2013
1 7 R_1 2013-02-28 2013
2 1 R_2 2013-03-31 2013
3 6 R_3 2013-04-30 2013
4 4 R_0 2013-05-31 2013
5 8 R_1 2013-06-30 2013
6 7 R_2 2013-07-31 2013
7 5 R_3 2013-08-31 2013
8 4 R_0 2013-09-30 2013
9 5 R_1 2013-10-31 2013
10 4 R_2 2013-11-30 2013
11 8 R_3 2013-12-31 2013
12 9 R_0 2014-01-31 2014
13 6 R_1 2014-02-28 2014
14 3 R_2 2014-03-31 2014
15 6 R_3 2014-04-30 2014
16 2 R_0 2014-05-31 2014
17 8 R_1 2014-06-30 2014
18 1 R_2 2014-07-31 2014
19 5 R_3 2014-08-31 2014
20 1 R_0 2014-09-30 2014
21 7 R_1 2014-10-31 2014
22 3 R_2 2014-11-30 2014
23 4 R_3 2014-12-31 2014
24 2 R_0 2015-01-31 2015
25 4 R_1 2015-02-28 2015
26 8 R_2 2015-03-31 2015
27 7 R_3 2015-04-30 2015
28 3 R_0 2015-05-31 2015
29 1 R_1 2015-06-30 2015
30 2 R_2 2015-07-31 2015
31 8 R_3 2015-08-31 2015
32 7 R_0 2015-09-30 2015
33 5 R_1 2015-10-31 2015
34 3 R_2 2015-11-30 2015
35 3 R_3 2015-12-31 2015
#df['date'] = pd.to_datetime(df['date']) #more versatile
#df.set_index('dates') #more versatile
#df.groupby([pd.Grouper(freq='1Y'),'restraunts'])['Rating'].mean() #more versatile
df = df.groupby(['year','restaurants']).agg({'Rating':[np.mean,np.std]})
print(df)
Output:
Rating Rating
year restaurants mean std
2013 R_0 5.333333 2.309401
R_1 6.666667 1.527525
R_2 4.000000 3.000000
R_3 6.333333 1.527525
2014 R_0 4.000000 4.358899
R_1 7.000000 1.000000
R_2 2.333333 1.154701
R_3 5.000000 1.000000
2015 R_0 4.000000 2.645751
R_1 3.333333 2.081666
R_2 4.333333 3.214550
R_3 6.000000 2.645751
EDIT:
Renaming columns:
df.columns = ['Mean','STD']
df.reset_index(inplace=True)
year restaurant Mean STD
0 2013 R_0 1.333333 0.577350
1 2013 R_1 5.333333 3.511885
2 2013 R_2 1.333333 0.577350
3 2013 R_3 4.333333 2.886751
4 2014 R_0 3.000000 1.000000
5 2014 R_1 3.666667 2.886751
6 2014 R_2 4.333333 4.041452
7 2014 R_3 5.333333 2.081666
8 2015 R_0 6.000000 2.645751
9 2015 R_1 6.333333 3.785939
10 2015 R_2 6.333333 3.785939
11 2015 R_3 5.666667 3.055050
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With