Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group and rank grouped records in a julia DataFrame

In julia I want to generate a rank within a group given by one column in the dataframe and the ranking based on a second column. I want to add the rank and number of rows in each group into new columns in the dataframe. The following appears to do what I require when grouping by :id and ranking by :b (is this correct?), but I have been unable to give the rank column a name. Is this sufficiently performant, or should I simply do it in a loop, after sorting on :id and :b? In my actual data/application I will have millions of rows and several hundred thousand unique groups.

 x = DataFrame(id=["a","b","a","c","c","a","c","a","a","c"], 
                b=[2,5,7,8,3,9,1,10,4,6], 
                c=["one","two","three","four","five","six","seven","eight","nine","ten"])
 y = combine(groupby(x,:id), sdf -> sort(sdf,:b), s->1:nrow(s), nrow => :n)

this outputs, which is correct, except that I wish to give 'x1' a column name like 'rank':

 Row │ id      b      c       x1     n     
     │ String  Int64  String  Int64  Int64 
─────┼─────────────────────────────────────
   1 │ a           2  one         1      5
   2 │ a           4  nine        2      5
   3 │ a           7  three       3      5
   4 │ a           9  six         4      5
   5 │ a          10  eight       5      5
   6 │ b           5  two         1      1
   7 │ c           1  seven       1      4
   8 │ c           3  five        2      4
   9 │ c           6  ten         3      4
  10 │ c           8  four        4      4
like image 991
Kobus Herbst Avatar asked Dec 07 '25 11:12

Kobus Herbst


1 Answers

This is the way to give rank column a name:

julia> combine(groupby(x,:id), sdf -> sort(sdf,:b), s->(rank=1:nrow(s),), nrow => :n)
10×5 DataFrame
 Row │ id      b      c       rank   n
     │ String  Int64  String  Int64  Int64
─────┼─────────────────────────────────────
   1 │ a           2  one         1      5
   2 │ a           4  nine        2      5
   3 │ a           7  three       3      5
   4 │ a           9  six         4      5
   5 │ a          10  eight       5      5
   6 │ b           5  two         1      1
   7 │ c           1  seven       1      4
   8 │ c           3  five        2      4
   9 │ c           6  ten         3      4
  10 │ c           8  four        4      4

or (which seems a bit cleaner for me)

julia> combine(groupby(x,:id), sdf -> sort(sdf,:b), :id => eachindex => :rank, nrow => :n)
10×5 DataFrame
 Row │ id      b      c       rank   n
     │ String  Int64  String  Int64  Int64
─────┼─────────────────────────────────────
   1 │ a           2  one         1      5
   2 │ a           4  nine        2      5
   3 │ a           7  three       3      5
   4 │ a           9  six         4      5
   5 │ a          10  eight       5      5
   6 │ b           5  two         1      1
   7 │ c           1  seven       1      4
   8 │ c           3  five        2      4
   9 │ c           6  ten         3      4
  10 │ c           8  four        4      4

(there is a PR open to add an option to add row-number more easily --- please comment there if you would find this useful)

Regarding your question about performance the following might be faster; the trade-off is between sorting a big data frame once vs small data frames many times:

transform!(groupby(sort(x, [:id, :b]), :id), :id => eachindex => :rank, nrow => :n)

It would be interesting to get a feedback on the comparison when you run this on your full data set.

like image 186
Bogumił Kamiński Avatar answered Dec 09 '25 17:12

Bogumił Kamiński