Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort with multiple fields in MapReduce Python Streaming?

I'm having a problem with sorting while using MapReduce with streaming and Python.

This is part of a bigger problem, but it can be reduced (no pun intended :) ) to this:

>> cat inputFile.txt
a       b       1       file1
a       b       2       file1
e       f       0       file2
d       c       3       file3
d       e       2       file4
a       c       5       file5
a       b       3       file1
d       c       2       file3
e       f       2       file2
a       c       4       file5
d       e       10      file4

The first and second columns are the keys.

I'd like the output of of the map phase to be sorted this way (first by column1, then 2 and then 3 numerically):

>>sort -k1,1 -k2,2 -k3n,3 inputFile.txt
a       b       1       file1
a       b       2       file1
a       b       3       file1
a       c       4       file5
a       c       5       file5
d       c       2       file3
d       c       3       file3
d       e       2       file4
d       e       10      file4
e       f       0       file2
e       f       2       file2

The forth column here is a hint on how I'd like the files to be for the reduce step, but it's OK if two keys are in the same file (as long as all instances of each key are in a single file). To achieve this I run the following command:

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.comparator.options="-k3,3" -D mapred.text.key.partitioner.options="-k3,3" -mapper cat -reducer cat -input /user/hadoop/inputFile.txt -output /user/hadoop/output

The output of this command is not sorted. For example:

>>cat output/part-00066
a       b       2       file1
a       b       3       file1
a       b       1       file1

Remarks:

  • I know that in the above command, I used "-k3,3" and not "-k3n,3" but I just wanted to see if any sort works at first
  • I tried using "-k1,1,-k2,2 -k3n,3" but I got the same result
  • I tried using 3 for the number of fields and it yielded a result where the keys are in separate files

It's like something really basic that I'm missing, what am I doing wrong here?

Thanks a lot for your help!

like image 739
elkon Avatar asked May 03 '26 13:05

elkon


1 Answers

After trying almost any possible combination, I've found that this works:

    hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
    -D \ 
 mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator\ 
-D stream.num.map.output.key.fields=4 \
    -D mapred.text.key.partitioner.options=-k1,2 \
    -D mapred.text.key.comparator.options=-"-k1,1 -k2,2 -k3n,3" \
    -input /user/hadoop/inputFile.txt \
    -output /user/hadoop/output \
    -mapper cat -reducer cat \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Further explanation could be found here:

The key (again, no pun intended :) ) is the use of the KeyFieldBasedPartitioner as the partitioner.

like image 67
elkon Avatar answered May 06 '26 03:05

elkon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!