I have a RDD in which each entry belongs to a class. I want to separate the single RDD into several RDD, such that all entries of a class goes into one RDD. Suppose I have 100 such classes in the input RDD, I want each clas into its own RDD. I can do this with a filter for each class (as shown below), but it would launch several jobs. Is there a better way to do it in a single job?
def method(val input:RDD[LabeledPoint], val classes:List[Double]):List[RDD] =
classes.map{lbl=>input.filter(_.label==lbl)}
Its similar to another question, but I have more than 2 classes (around 10)
I was facing the same issue and unfortunately there is no other way according to different resources I found.
The thing is that you need to go from RDD to create the actual list in your result and if you look here, the answer also says it's not possible.
What you do should be fine and if you want to optimize things, then just go for caching the data if you can.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With