How spark works when a join is followed by a coalesce

Question

Given I have 2 DataFrames df1 and df2

I perform a join followed by a coalesce

df1.join(df2, Seq("id")).coalesce(1)

It seems that Spark create 2 stages, and the second stage, where the SortMergeJoin happens, is computed only by one task.

So this unique task need to have both entire dataframes in memory (cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks).

Can you confirm ?

(I'd have expected that the sort use the spark.sql.shuffle.partitions settings and a third additional stage perform the coalesce).

cf DAG

enter image description here

Yann Moisan · Accepted Answer

I've found the confirmation in the book High Performance Spark.

Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a coalesce operation is equivalent to the number of partitions in the result RDD of the coalesce transformation.

How spark works when a join is followed by a coalesce

Tags:

apache-spark

apache-spark-sql

Yann Moisan

1 Answers

Yann Moisan

Recent Activity

Donate For Us

How spark works when a join is followed by a coalesce

Tags:

apache-spark

apache-spark-sql

Yann Moisan

1 Answers

Yann Moisan

Related questions

Recent Activity

Donate For Us