Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does toLocalIterator works?

I am trying to understand how toLocalIterator works, I read few threads and blog However I am not sure about one particular thing.

Does it copies all the partitions at once to the driver node and creates an iterator? Or does it copies data one partition at a time and then creates an iterator?

like image 930
Gaurang Shah Avatar asked Oct 21 '25 19:10

Gaurang Shah


1 Answers

It will bring in one partition at at time. According to the documentation

Return an iterator that contains all of the elements in this RDD.

The iterator will consume as much memory as the largest partition in this RDD.

Note This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.

The iterator will "consume as much memory as the largest partition" can be understood to mean that only 1 (whole) partition will ever be on the driver node.

So, the first partition is sent to the driver. If you continue to iterate and reach the end of the first partition, the second partition will be sent to the driver node. If the parent RDD is not cached, this will also result in re-computation of data.

like image 55
David Avatar answered Oct 23 '25 07:10

David



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!