In my understanding columnar format is better for MapReduce tasks. Even for something like selection of some columns, columnar works well as we don't have to load other columns into memory.
But in Spark 3.0 I'm seeing this ColumnarToRow operation being applied in the query plans which from what I could understand from the docs converts the data into row format.
How is it more efficient than the columnar representation, what are the insights that govern application of this rule?
For the following code I've attached the query plan.
import pandas as pd
df = pd.DataFrame({
'a': [i for i in range(2000)],
'b': [i for i in reversed(range(2000))],
})
df = spark.createDataFrame(df)
df.cache()
df.select('a').filter('a > 500').show()

ColumnarToRow part in your WSCG is actually a conversion of pandas dataframe to Spark DataFrame rather than any indication of how Spark processes its own dataframes.
If we start with a "native" Spark df, the plan looks much different:
>>> a = range(2000)
>>> b = [ i for i in reversed(range(2000))]
>>> df = spark.createDataFrame(zip(a,b),["a","b"])
>>> df.select('a').filter('a > 500').show()

Besides, the link you were referring to says:
case classColumnarToRowExec(child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with SerializableProvides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed.
...which basically means a conversion of an external RDD (pandas in your case) into internal Spark representation (RDD of InternalRows).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With