Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is ColumnarToRow an efficient operation in Spark

In my understanding columnar format is better for MapReduce tasks. Even for something like selection of some columns, columnar works well as we don't have to load other columns into memory.

But in Spark 3.0 I'm seeing this ColumnarToRow operation being applied in the query plans which from what I could understand from the docs converts the data into row format.

How is it more efficient than the columnar representation, what are the insights that govern application of this rule?

For the following code I've attached the query plan.

import pandas as pd

df = pd.DataFrame({
    'a': [i for i in range(2000)],
    'b': [i for i in reversed(range(2000))],
})

df = spark.createDataFrame(df)

df.cache()
df.select('a').filter('a > 500').show()

query plan

like image 896
kar09 Avatar asked Jan 19 '26 00:01

kar09


1 Answers

ColumnarToRow part in your WSCG is actually a conversion of pandas dataframe to Spark DataFrame rather than any indication of how Spark processes its own dataframes.

If we start with a "native" Spark df, the plan looks much different:

>>> a = range(2000)                                                                                                                                                    
>>> b = [ i for i in reversed(range(2000))]                                                                                                                            
>>> df = spark.createDataFrame(zip(a,b),["a","b"])                                                                                                 
>>> df.select('a').filter('a > 500').show()                                                                                                     

enter image description here

Besides, the link you were referring to says:

case classColumnarToRowExec(child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed.

...which basically means a conversion of an external RDD (pandas in your case) into internal Spark representation (RDD of InternalRows).

like image 146
mazaneicha Avatar answered Jan 21 '26 22:01

mazaneicha



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!