I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?
You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of
org.apache.hadoop.mapreduce.Mapper
to create an in-memory map along the following lines:
Map<LongWritable, Text> inmemMap = null
   protected void setup(Mapper.Context context) throws IOException, InterruptedException {
   inmemMap  = new Map<LongWritable, Text>();
 }
Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
  for (LongWritable key : inmemMap.keySet()) {
      Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on 
                   the inmemMap.     
      context.write(key, myAggregatedText);
  }
}
Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.
In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With