I am creating an empty dataframe for some requirement and when I am calling the withColumn function on it, I'm getting the columns but the data is coming as null as following-
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
json = list(map(lambda row: row.asDict(True), df.collect()))
df.show()
++
||
++
++
df= df.withColumn('First_name',F.lit('Tony'))\
.withColumn('Last_name',F.lit('Chapman'))\
.withColumn('Age',F.lit('28'))
df.show()
+----------+---------+---+
|First_name|Last_name|Age|
+----------+---------+---+
+----------+---------+---+
What is the reason for this? How to solve this?
that's the expected result - withColumn means spark will iterate on all the rows and then add a column to each. Since your dataframe is empty there's nothing to iterate on so no values
if you want to take some data into a dataframe you need to use parallelize
from pyspark.sql import Row
l = [('Tony','Chapman',28)]
rdd = sc.parallelize(l)
rdd_rows = rdd.map(lambda x: Row(First_Name=x[0],Last_Name=x[1] Age=int(x[2])))
df = sqlContext.createDataFrame(rdd_rows)
or from Spark 2.0 (thanks pault) you can skip the rdd creation
l = [('Tony','Chapman',28)]
df = sqlContext.createDataFrame(l, ["First_Name", "Last_Name", "Age"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With