Hello stackoverflow community.
I ask your help in understanding if my thoughts are correct or I'm missing some points in my Spark job.
I currently have two rdds that I want to subtract. Both the rdds are built as different transformations on the same father RDD.
First of all, the father RDD is cached after it is obtained:
val fatherRdd = grandFather.repartition(n).mapPartitions(mapping).cache
Then the two rdds are transformed. One is (pseudocode):
son1= rddFather.filter(filtering_logic).map(take_only_key).distinct
The other one is :
son2= rddFather.filter(filtering_logic2).map(take_only_key).distinct
The two sons are then subtracted to obtain only the keys in son1:
son1.subtract(son2)
I would expect the squence of the transformations to be the following:
Then, starting from cached data, map filter map distinct on both rdds and then subtracting.
This is not happening, what I see are the two distinct
operations running in parallel, apparently not exploiting the benefits of caching (there are no skipped tasks), and taking almost the same computation time.
Below the image of the dag taken from spark ui.
Do you have any suggestions for me?
You are correct in your observations. Transformations on RDDs are lazy, so caching will happen after the first time the RDD is actually computed.
If you call an action on your parent RDD, it should be computed and cached. Then your subsequent operations will operate on the cached data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With