Given I have 2 DataFrames df1 and df2
I perform a join followed by a coalesce
df1.join(df2, Seq("id")).coalesce(1)
It seems that Spark create 2 stages, and the second stage, where the SortMergeJoin happens, is computed only by one task.
So this unique task need to have both entire dataframes in memory (cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks).
Can you confirm ?
(I'd have expected that the sort use the spark.sql.shuffle.partitions settings and a third additional stage perform the coalesce).
cf DAG

I've found the confirmation in the book High Performance Spark.
Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a
coalesceoperation is equivalent to the number of partitions in the result RDD of thecoalescetransformation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With