I'm new to pandas/Numpy and I'm playing around to see how everything works.
I'm using this dataset for the top 1000 IMDb movie ratings: https://github.com/justmarkham/pandas-videos/blob/master/data/imdb_1000.csv
I'm trying to group by genre, filter by number of movies (> 100), and then display min/max/mean (as an integer)/median (as int)/count.
So far I have:
df.groupby("genre")['duration'].aggregate(['min', max, np.mean, np.median, 'count']).sort_values('median', ascending=False)
This shows all the genres and the duration statistics, but the mean and median are floats, and it includes those with a low count.
I want to somehow combine it with something like this:
df.groupby("genre")['duration'].filter(lambda x: x.count() > 100)
And
df.groupby("genre")['duration'].mean().astype(int)
Is this possible?
Next I'll want to graph it all, but that's for another day...
Thanks!
EDIT 1
For clarification, currently I get:
genre min max mean median count
Western 85 175 136.666667 135.0 9
Adventure 89 224 134.840000 127.0 75
Biography 85 202 131.844156 127.0 77
Action 80 205 126.485294 125.0 136
Drama 64 242 126.539568 123.0 278
Crime 67 229 122.298387 118.0 124
Thriller 107 120 114.200000 116.0 5
Mystery 69 160 115.625000 115.0 16
Sci-Fi 91 132 109.000000 113.0 5
Fantasy 112 112 112.000000 112.0 1
Family 100 115 107.500000 107.5 2
Comedy 68 187 107.602564 104.0 156
Horror 70 146 102.517241 104.0 29
Animation 75 134 96.596774 94.5 62
Film-Noir 88 111 97.333333 93.0 3
History 66 66 66.000000 66.0 1
But I want:
genre min max mean median count
Action 80 205 126 125 136
Drama 64 242 127 123 278
Crime 67 229 122 118 124
Comedy 68 187 108 104 156
Yes, you can simply chain the filters and groupbys:
df.groupby('genre').filter(
lambda x: len(x) > 100
).groupby('genre')['duration'].aggregate(
['min','max','mean','median','count']
).sort_values('median', ascending=False)
This yields as result:
>>> df.groupby('genre').filter(lambda x: len(x) > 100).groupby('genre')['duration'].aggregate(['min','max','mean','median','count']).sort_values('median', ascending=False)
min max mean median count
genre
Action 80 205 126.485294 125 136
Drama 64 242 126.539568 123 278
Crime 67 229 122.298387 118 124
Comedy 68 187 107.602564 104 156
you can convert this to integers as well:
>>> df.groupby('genre').filter(lambda x: len(x) > 100).groupby('genre')['duration'].aggregate(['min','max','mean','median','count']).sort_values('median', ascending=False).astype(int)
min max mean median count
genre
Action 80 205 126 125 136
Drama 64 242 126 123 278
Crime 67 229 122 118 124
Comedy 68 187 107 104 156
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With