Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID.
Is it possible to use the rank function directly for the same? Any tips would be appreciated.
Data:
id,rating
X001, 9
X001, 9
X001, 8
X002, 9
X002, 7
X002, 6
X002, 5
X003, 8
X004, 8
X004, 7
X004, 7
X004, 4
On using rank function like: op = rank data by id,score;
I get this output
rank,id,rating
1, X001, 9
1, X001, 9
2, X001, 8
3, X002, 9
4, X002, 7
5, X002, 6
6, X002, 5
7, X003, 8
8, X004, 8
9, X004, 7
9, X004, 7
10, X004, 4
Desired O/P:
rank,id,rating
1, X001, 9
1, X001, 9
2, X001, 8
1, X002, 9
2, X002, 7
3, X002, 6
4, X002, 5
1, X003, 8
1, X004, 8
2, X004, 7
2, X004, 7
3, X004, 4
You can group your data by id then use the UDF Enumerate (DataFu) to append an index to each tuple of the bags.
register datafu-1.1.0.jar;
define Enumerate datafu.pig.bags.Enumerate('1');
data = load 'data' using PigStorage(',') as (id:chararray, rating:int);
data = group data by id;
data = foreach data {
  sorted = order data by rating DESC;
  generate group, sorted;
}
data = foreach data generate FLATTEN(Enumerate(sorted));
data = foreach data generate $2, $0, $1;
dump data;
DataFu jar file can be downloaded from Maven Central repository: http://search.maven.org/#search|ga|1|g%3A%22com.linkedin.datafu%22
You can use RANK function as below: B = rank A by rating DESC; dump B;
Note: considering A having (id, rating) mentioned in your example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With