Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference and relationship between slots, map tasks, data splits, Mapper

I have gone thru few hadoop info books and papers.

A Slot is a map/reduce computation unit at a node. it may be map or reduce slot. As far as, i know split is a group of blocks of files in HDFS which have some length and location of nodes where they ares stored. Mapper is class but when the code is instantiated it is called map task. Am i right ? I am not clear of difference and relationship between map tasks, data splits and Mapper.

Regarding scheduling i understand that when a map slot of a node is free a map task is choosen from the non-running map task and launched if the data to be processed by the map task is the node. Can anyone explain it clearly in terms of above concepts: slots, mapper and map task etc.

Thanks, Arun

like image 782
Arun K Avatar asked Oct 19 '25 12:10

Arun K


1 Answers

As far as, I know split is a group of blocks of files in HDFS which have the same length and location of nodes where they are stored.

InputSplit is a unit of data which a particular mapper will process. It needs not be just a group of HDFS blocks. It can be a single line, 100 rows from a DB, a 50MB file etc.

I am not clear about difference and relationship between map tasks, data splits and Mapper.

An InputSplit is processed by a map task and an instance of Mapper is a Map task.

like image 151
Praveen Sripati Avatar answered Oct 22 '25 07:10

Praveen Sripati