Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How spark works when a join is followed by a coalesce

Given I have 2 DataFrames df1 and df2

I perform a join followed by a coalesce

df1.join(df2, Seq("id")).coalesce(1)

It seems that Spark create 2 stages, and the second stage, where the SortMergeJoin happens, is computed only by one task.

So this unique task need to have both entire dataframes in memory (cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks).

Can you confirm ?

(I'd have expected that the sort use the spark.sql.shuffle.partitions settings and a third additional stage perform the coalesce).

cf DAG

enter image description here

like image 832
Yann Moisan Avatar asked Jan 31 '26 21:01

Yann Moisan


1 Answers

I've found the confirmation in the book High Performance Spark.

Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a coalesce operation is equivalent to the number of partitions in the result RDD of the coalesce transformation.

like image 135
Yann Moisan Avatar answered Feb 02 '26 17:02

Yann Moisan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!