Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does it make sense to create more reducers than nodes we have?

So main think with good reduce phase is good partition distribution. But for example we can't control it, or do not know how to do this(we don't know our data).

Is the big amount of reducers will increase chances of better per reducer data distribution? What is common practice in this question?

like image 449
yura Avatar asked Dec 05 '25 10:12

yura


1 Answers

Data is usually evenly distributed among reducers using modulus hash partitioning. That means (effectively) that the hash of the key is divided by the number of reducers, and the remainder is the index of the reducer that the value gets sent to. For example, if the hash of your key is 47269893425623, and you have 10 reducers, 47269893425623 % 10 = 3, so the 4th reducer (remember, 0-indexed) gets that record.

If your records have hot-spot keys, meaning that a large percentage of the values have exactly the same key, then adding reducers will probably not help (you'll just be adding overhead- all of those keys will still go to the same reducer).

If you do not have that situation, then adding reducers may help. Just remember that there is a network copy stage between mapper and reducer. The more that you split up the reducers, the more copying needs to be done between the mappers and reducers, so that part of the job will get slower.

like image 54
Chris Shain Avatar answered Dec 08 '25 19:12

Chris Shain