So main think with good reduce phase is good partition distribution. But for example we can't control it, or do not know how to do this(we don't know our data).
Is the big amount of reducers will increase chances of better per reducer data distribution? What is common practice in this question?
Data is usually evenly distributed among reducers using modulus hash partitioning. That means (effectively) that the hash of the key is divided by the number of reducers, and the remainder is the index of the reducer that the value gets sent to. For example, if the hash of your key is 47269893425623, and you have 10 reducers, 47269893425623 % 10 = 3, so the 4th reducer (remember, 0-indexed) gets that record.
If your records have hot-spot keys, meaning that a large percentage of the values have exactly the same key, then adding reducers will probably not help (you'll just be adding overhead- all of those keys will still go to the same reducer).
If you do not have that situation, then adding reducers may help. Just remember that there is a network copy stage between mapper and reducer. The more that you split up the reducers, the more copying needs to be done between the mappers and reducers, so that part of the job will get slower.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With