Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove first element in RDD without using filter function

I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.

val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
                                              .coalesce(100*spark.defaultParallelism) 
                                              .zipWithIndex() //RDD[String, Long]
                                              .filter(f => f._2!=0)

The reason I do the last operation above (filter) is to remove the first index 0.

Is there a better way to remove the first element rather than to check each element for the index value as done above?

Thanks!

like image 464
user1384205 Avatar asked Oct 19 '25 13:10

user1384205


1 Answers

One possibility is to use RDD.mapPartitionsWithIndex and to remove the first element from the iterator at index 0:

val inputRDD = myUtilities
                .paragraphFile(spark,path1)
                .coalesce(100*spark.defaultParallelism) 
                .mapPartitionsWithIndex(
                   (index, it) => if (index == 0) it.drop(1) else it,
                    preservesPartitioning = true
                 )

This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.

like image 149
Yuval Itzchakov Avatar answered Oct 22 '25 04:10

Yuval Itzchakov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!