We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList (which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.
Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With