How to sort with multiple fields in MapReduce Python Streaming?

Question

I'm having a problem with sorting while using MapReduce with streaming and Python.

This is part of a bigger problem, but it can be reduced (no pun intended :) ) to this:

>> cat inputFile.txt
a       b       1       file1
a       b       2       file1
e       f       0       file2
d       c       3       file3
d       e       2       file4
a       c       5       file5
a       b       3       file1
d       c       2       file3
e       f       2       file2
a       c       4       file5
d       e       10      file4

The first and second columns are the keys.

I'd like the output of of the map phase to be sorted this way (first by column1, then 2 and then 3 numerically):

>>sort -k1,1 -k2,2 -k3n,3 inputFile.txt
a       b       1       file1
a       b       2       file1
a       b       3       file1
a       c       4       file5
a       c       5       file5
d       c       2       file3
d       c       3       file3
d       e       2       file4
d       e       10      file4
e       f       0       file2
e       f       2       file2

The forth column here is a hint on how I'd like the files to be for the reduce step, but it's OK if two keys are in the same file (as long as all instances of each key are in a single file). To achieve this I run the following command:

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.comparator.options="-k3,3" -D mapred.text.key.partitioner.options="-k3,3" -mapper cat -reducer cat -input /user/hadoop/inputFile.txt -output /user/hadoop/output

The output of this command is not sorted. For example:

>>cat output/part-00066
a       b       2       file1
a       b       3       file1
a       b       1       file1

Remarks:

I know that in the above command, I used "-k3,3" and not "-k3n,3" but I just wanted to see if any sort works at first
I tried using "-k1,1,-k2,2 -k3n,3" but I got the same result
I tried using 3 for the number of fields and it yielded a result where the keys are in separate files

It's like something really basic that I'm missing, what am I doing wrong here?

Thanks a lot for your help!

elkon · Accepted Answer

After trying almost any possible combination, I've found that this works:

    hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
    -D \ 
 mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator\ 
-D stream.num.map.output.key.fields=4 \
    -D mapred.text.key.partitioner.options=-k1,2 \
    -D mapred.text.key.comparator.options=-"-k1,1 -k2,2 -k3n,3" \
    -input /user/hadoop/inputFile.txt \
    -output /user/hadoop/output \
    -mapper cat -reducer cat \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Further explanation could be found here:

The key (again, no pun intended :) ) is the use of the KeyFieldBasedPartitioner as the partitioner.

How to sort with multiple fields in MapReduce Python Streaming?

Tags:

python

mapreduce

elkon

1 Answers

elkon

Recent Activity

Donate For Us

How to sort with multiple fields in MapReduce Python Streaming?

Tags:

python

mapreduce

elkon

1 Answers

elkon

Related questions

Recent Activity

Donate For Us