Pyspark

Question

I am creating an empty dataframe for some requirement and when I am calling the withColumn function on it, I'm getting the columns but the data is coming as null as following-

schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
json = list(map(lambda row: row.asDict(True), df.collect()))
df.show()

++
||
++
++

df= df.withColumn('First_name',F.lit('Tony'))\
                    .withColumn('Last_name',F.lit('Chapman'))\
                .withColumn('Age',F.lit('28'))
df.show()

+----------+---------+---+
|First_name|Last_name|Age|
+----------+---------+---+
+----------+---------+---+

What is the reason for this? How to solve this?

Arnon Rotem-Gal-Oz · Accepted Answer

that's the expected result - withColumn means spark will iterate on all the rows and then add a column to each. Since your dataframe is empty there's nothing to iterate on so no values

if you want to take some data into a dataframe you need to use parallelize

from pyspark.sql import Row
l = [('Tony','Chapman',28)]
rdd = sc.parallelize(l)
rdd_rows = rdd.map(lambda x: Row(First_Name=x[0],Last_Name=x[1] Age=int(x[2])))
df = sqlContext.createDataFrame(rdd_rows)

or from Spark 2.0 (thanks pault) you can skip the rdd creation

l = [('Tony','Chapman',28)]
df = sqlContext.createDataFrame(l, ["First_Name", "Last_Name", "Age"]

Pyspark - withColumn is not working while calling on empty dataframe

Tags:

python

Visualisation App

1 Answers

Arnon Rotem-Gal-Oz

Recent Activity

Donate For Us