I'm having a problem with sorting while using MapReduce with streaming and Python.
This is part of a bigger problem, but it can be reduced (no pun intended :) ) to this:
>> cat inputFile.txt
a b 1 file1
a b 2 file1
e f 0 file2
d c 3 file3
d e 2 file4
a c 5 file5
a b 3 file1
d c 2 file3
e f 2 file2
a c 4 file5
d e 10 file4
The first and second columns are the keys.
I'd like the output of of the map phase to be sorted this way (first by column1, then 2 and then 3 numerically):
>>sort -k1,1 -k2,2 -k3n,3 inputFile.txt
a b 1 file1
a b 2 file1
a b 3 file1
a c 4 file5
a c 5 file5
d c 2 file3
d c 3 file3
d e 2 file4
d e 10 file4
e f 0 file2
e f 2 file2
The forth column here is a hint on how I'd like the files to be for the reduce step, but it's OK if two keys are in the same file (as long as all instances of each key are in a single file). To achieve this I run the following command:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.comparator.options="-k3,3" -D mapred.text.key.partitioner.options="-k3,3" -mapper cat -reducer cat -input /user/hadoop/inputFile.txt -output /user/hadoop/output
The output of this command is not sorted. For example:
>>cat output/part-00066
a b 2 file1
a b 3 file1
a b 1 file1
Remarks:
It's like something really basic that I'm missing, what am I doing wrong here?
Thanks a lot for your help!
After trying almost any possible combination, I've found that this works:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-D \
mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator\
-D stream.num.map.output.key.fields=4 \
-D mapred.text.key.partitioner.options=-k1,2 \
-D mapred.text.key.comparator.options=-"-k1,1 -k2,2 -k3n,3" \
-input /user/hadoop/inputFile.txt \
-output /user/hadoop/output \
-mapper cat -reducer cat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Further explanation could be found here:
The key (again, no pun intended :) ) is the use of the KeyFieldBasedPartitioner as the partitioner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With