Does Spark do one or multiple passes through data when multiple <code>withColumn</code> functions are chained? For example: <pre class="prettyprint"><code>val dfnew = df.withColumn("newCol1", f1(col("a"))) .withColumn("newCol2", f2(col("b"))) .withColumn("newCol3", f3(col("c"))) </code></pre> where <ul> <li> <code>df</code> is my input <code>DataFrame</code> containing at least columns a, b, c</li> <li> <code>dfnew</code> is output <code>DataFrame</code> with three new columns newCol1, newCol2, newCol3</li> <li> <code>f1</code>, <code>f2</code>, <code>f3</code> are some user defined functions or some spark operations on columns like cast, etc In my project I can have even 30 independent <code>withColumn</code> function chained with <code>foldLeft</code>.</li> </ul> Important I am assuming here that <code>f2</code> does not depend on result of <code>f1</code>, while <code>f3</code> does not depend on result of <code>f1</code> and <code>f2</code>. The functions could be performed in any order. There is no shuffle in any function My observations <ul> <li>all functions are in the same stage</li> <li>addition of new <code>withColumn</code> does not increase execution time in such a way to suspect additional passages through data.</li> <li>I have tested for example single <code>SQLTransformer</code> with select statement containing all functions vs multiple separate <code>SQLTransformer</code> one for each function and the execution time was similar. </li> </ul> Questions <ul> <li>Will spark make one or three passages through the data, once for each <code>withColumn</code>?</li> <li>Does it depend on the type of functions <code>f1</code>, <code>f2</code>, <code>f3</code>? UDF vs generic Spark operations?</li> <li>If the functions <code>f1</code>, <code>f2</code>, <code>f3</code> are inside the same stage, does it mean they are in the same data pass?</li> <li>Does number of passages depend on shuffles within functions? If there is no shuffle?</li> <li>If I chain the <code>withColumn</code> functions with <code>foldLeft</code> will it change number of passages?</li> <li>I could do something similar with three <code>SQLTransformers</code> or just one <code>SQLTransformer</code> with all three transformations in the same select_statement. How many passes through data that would do?</li> <li>Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?</li> </ul>

<blockquote> Will spark make one or three passages through the data, once for each withColumn? </blockquote> Spark will "make one passage" through the data. Why? Because spark doesn't actually do anything when this code is reached, it just builds an execution plan which would tell it what to do when <code>dfnew</code> is used (i.e. some action, e.g. <code>count</code>, <code>collect</code>, <code>write</code> etc.) is executed on it. Then, it would be able to compute all functions at once for each record. <blockquote> Does it depend on the type of functions f1, f2, f3? UDF vs generic Spark operations? </blockquote> No. <blockquote> If the functions f1, f2, f3 are inside the same stage, does it mean they are in the same data pass? </blockquote> Yes. <blockquote> Does number of passages depend on shuffles within functions? If there is no shuffle? </blockquote> Almost. First, as long as no caching / checkpointing is used, the number of passages over the data will be the number of actions executed on the resulting <code>newdf</code> DataFrame. Then, each shuffle means each record is read, potentially sent between worker nodes, potentially written to disk, and then read again. <blockquote> If I chain the withColumn functions with foldLeft will it change number of passages? </blockquote> No. It will only change the way the above-mentioned plan is constructed, but it will have no effect on how this plan looks (would be the exact same plan), so the computation will remain the same. <blockquote> I could do something similar with three SQLTransformers or just one SQLTransformer with all three transformations in the same select_statement. How many passes through data that would do? </blockquote> Again, this won't make any difference, as the execution plan will remain the same. <blockquote> Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages? </blockquote> Not sure what this means, but sounds like this is not correct: the time of execution is mostly a factor of number of shuffles and number of actions (assuming same data and same cluster setup).

Does Spark do one pass through the data for multiple withColumn?

Tags:

scala

apache-spark

apache-spark-sql

Does Spark do one or multiple passes through data when multiple withColumn functions are chained?

For example:

val dfnew = df.withColumn("newCol1", f1(col("a")))
              .withColumn("newCol2", f2(col("b")))
              .withColumn("newCol3", f3(col("c")))

where

df is my input DataFrame containing at least columns a, b, c
dfnew is output DataFrame with three new columns newCol1, newCol2, newCol3
f1, f2, f3 are some user defined functions or some spark operations on columns like cast, etc In my project I can have even 30 independent withColumn function chained with foldLeft.

Important

I am assuming here that f2 does not depend on result of f1, while f3 does not depend on result of f1 and f2. The functions could be performed in any order. There is no shuffle in any function

My observations

all functions are in the same stage
addition of new withColumn does not increase execution time in such a way to suspect additional passages through data.
I have tested for example single SQLTransformer with select statement containing all functions vs multiple separate SQLTransformer one for each function and the execution time was similar.

Questions

Will spark make one or three passages through the data, once for each withColumn?
Does it depend on the type of functions f1, f2, f3? UDF vs generic Spark operations?
If the functions f1, f2, f3 are inside the same stage, does it mean they are in the same data pass?
Does number of passages depend on shuffles within functions? If there is no shuffle?
If I chain the withColumn functions with foldLeft will it change number of passages?
I could do something similar with three SQLTransformers or just one SQLTransformer with all three transformations in the same select_statement. How many passes through data that would do?
Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?

847

asked Dec 18 '17 15:12

astro_asz

1 Answers

Will spark make one or three passages through the data, once for each withColumn?

Spark will "make one passage" through the data. Why? Because spark doesn't actually do anything when this code is reached, it just builds an execution plan which would tell it what to do when dfnew is used (i.e. some action, e.g. count, collect, write etc.) is executed on it. Then, it would be able to compute all functions at once for each record.

Does it depend on the type of functions f1, f2, f3? UDF vs generic Spark operations?

No.

If the functions f1, f2, f3 are inside the same stage, does it mean they are in the same data pass?

Yes.

Does number of passages depend on shuffles within functions? If there is no shuffle?

Almost. First, as long as no caching / checkpointing is used, the number of passages over the data will be the number of actions executed on the resulting newdf DataFrame. Then, each shuffle means each record is read, potentially sent between worker nodes, potentially written to disk, and then read again.

If I chain the withColumn functions with foldLeft will it change number of passages?

No. It will only change the way the above-mentioned plan is constructed, but it will have no effect on how this plan looks (would be the exact same plan), so the computation will remain the same.

I could do something similar with three SQLTransformers or just one SQLTransformer with all three transformations in the same select_statement. How many passes through data that would do?

Again, this won't make any difference, as the execution plan will remain the same.

Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?

Not sure what this means, but sounds like this is not correct: the time of execution is mostly a factor of number of shuffles and number of actions (assuming same data and same cluster setup).

164

answered Oct 12 '22 00:10

Tzach Zohar

Related questions
                            
                                How to set mainClass in ScalaJS build.sbt?
                            
                                Function type with receiver in Scala
                            
                                What is non blocking and blocking future in Scala?
                            
                                java.lang.RuntimeException: There is no started application error, when testing a class from scala worksheet
                            
                                Play 2.5 Migration Error: Custom Action with BodyParser: could not find implicit value for parameter mat: akka.stream.Materializer
                            
                                Re-using A Schema from JSON within a Spark DataFrame using Scala
                            
                                Akka Flow hangs when making http requests via connection pool
                            
                                When is ExecutionContext#reportFailure(Throwable) called?
                            
                                Serving Scala.js assets
                            
                                Free implementation in scalaz
                            
                                In Scala, is it possible to "curry" type parameters of a def?
                            
                                How to build Spark from the sources from the Download Spark page?
                            
                                Scala dependency injection when using case class/companion object pattern
                            
                                Cannot load main class from JAR file
                            
                                Slick 3: How to implement repository pattern with transactions?
                            
                                Where was scala_home homebrew installed on OSX?
                            
                                Can we able to use mulitple sparksessions to access two different Hive servers
                            
                                SHA256 of data stream
                            
                                convert multi line string to single line
                            
                                SBT: How to package an instance of a class as a JAR?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With