I am trying to understand how toLocalIterator works, I read few threads and blog However I am not sure about one particular thing.
Does it copies all the partitions at once to the driver node and creates an iterator? Or does it copies data one partition at a time and then creates an iterator?
It will bring in one partition at at time. According to the documentation
Return an iterator that contains all of the elements in this RDD.
The iterator will consume as much memory as the largest partition in this RDD.
Note This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
The iterator will "consume as much memory as the largest partition" can be understood to mean that only 1 (whole) partition will ever be on the driver node.
So, the first partition is sent to the driver. If you continue to iterate and reach the end of the first partition, the second partition will be sent to the driver node. If the parent RDD is not cached, this will also result in re-computation of data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With