How does toLocalIterator works?

Question

I am trying to understand how toLocalIterator works, I read few threads and blog However I am not sure about one particular thing.

Does it copies all the partitions at once to the driver node and creates an iterator? Or does it copies data one partition at a time and then creates an iterator?

David · Accepted Answer

It will bring in one partition at at time. According to the documentation

Return an iterator that contains all of the elements in this RDD.

The iterator will consume as much memory as the largest partition in this RDD.

Note This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.

The iterator will "consume as much memory as the largest partition" can be understood to mean that only 1 (whole) partition will ever be on the driver node.

So, the first partition is sent to the driver. If you continue to iterate and reach the end of the first partition, the second partition will be sent to the driver node. If the parent RDD is not cached, this will also result in re-computation of data.

How does toLocalIterator works?

Tags:

apache-spark

hadoop

hadoop2

pyspark

Gaurang Shah

1 Answers

David

Recent Activity

Donate For Us

How does toLocalIterator works?

Tags:

apache-spark

hadoop

hadoop2

pyspark

Gaurang Shah

1 Answers

David

Related questions

Recent Activity

Donate For Us