My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file. For example, I have input csv as follows: <pre class="prettyprint"><code>Id|FirstName|LastName|LocationId 1|John|Doe|123 2|Alex|Doe|234 </code></pre> My transformation is: <pre class="prettyprint"><code>Select Id, FirstName, LastName, LocationId as PrimaryLocationId, null as SecondaryLocationId from Input </code></pre> (I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the datatype of SecondaryLocationId and returns null in the schema and throws the error CSV data source does not support null data type while writing to output csv. Below are printSchema() and write options I am using. <pre class="prettyprint"><code>root |-- Id: string (nullable = true) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- PrimaryLocationId: string (nullable = false) |-- SecondaryLocationId: null (nullable = true) dataFrame.repartition(1).write .mode(SaveMode.Overwrite) .option("header", "true") .option("delimiter", "|") .option("nullValue", "") .option("inferSchema", "true") .csv(outputPath) </code></pre> Is there a way to default to a datatype (such as string)? By the way, I can get this to work by replacing null with empty string('') but that is not what I want to do.

use lit(null): import org.apache.spark.sql.functions.{lit, udf} Example: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.{lit, udf} case class Record(foo: Int, bar: String) val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF val dfWithFoobar = df.withColumn("foobar", lit(null: String)) scala> dfWithFoobar.printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: null (nullable = true) and it is not retained by the csv writer. If it is a hard requirement you can cast column to the specific type (lets say String): import org.apache.spark.sql.types.StringType df.withColumn("foobar", lit(null).cast(StringType)) </code></pre> or use an UDF like this: <pre class="prettyprint"><code>val getNull = udf(() => None: Option[String]) // Or some other type df.withColumn("foobar", getNull()).printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: string (nullable = true) </code></pre> reposting zero323 code. Now lets discuss your second question Question : "This is only when I know which columns will be treated as null datatype. When a large number of files are being read and applied various transformations on, then I wouldn't know or is there a way I might know which fields are null treated? " Ans : In this case you can use option The Databricks Scala style guide does not agree that null should always be banned from Scala code and says: “For performance sensitive code, prefer null over Option, in order to avoid virtual method calls and boxing.” Example : <pre class="prettyprint"><code>+------+ |number| +------+ | 1| | 8| | 12| | null| +------+ val actualDf = sourceDf.withColumn( "is_even", when( col("number").isNotNull, isEvenSimpleUdf(col("number")) ).otherwise(lit(null)) ) actualDf.show() +------+-------+ |number|is_even| +------+-------+ | 1| false| | 8| true| | 12| true| | null| null| +------+-------+ </code></pre> <ul> <li>https://medium.com/@mrpowers/dealing-with-null-in-spark-cfdbb12f231e</li> <li>https://github.com/vaquarkhan/scala-style-guide</li> </ul>

Spark treating null values in csv column as null datatype

Tags:

apache-spark-sql

spark-dataframe

My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file.

For example, I have input csv as follows:

Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234

My transformation is:

Select Id, 
       FirstName, 
       LastName, 
       LocationId as PrimaryLocationId,
       null as SecondaryLocationId
from Input

(I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the datatype of SecondaryLocationId and returns null in the schema and throws the error CSV data source does not support null data type while writing to output csv.

Below are printSchema() and write options I am using.

root
     |-- Id: string (nullable = true)
     |-- FirstName: string (nullable = true)
     |-- LastName: string (nullable = true)
     |-- PrimaryLocationId: string (nullable = false)
     |-- SecondaryLocationId: null (nullable = true)

dataFrame.repartition(1).write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .option("delimiter", "|")
      .option("nullValue", "")
      .option("inferSchema", "true")
      .csv(outputPath)

Is there a way to default to a datatype (such as string)? By the way, I can get this to work by replacing null with empty string('') but that is not what I want to do.

747

asked Sep 27 '17 04:09

tturner

1 Answers

use lit(null): import org.apache.spark.sql.functions.{lit, udf}

Example:

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))


scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv writer. If it is a hard requirement you 
 can cast column to the specific type (lets say String):

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

reposting zero323 code.

Now lets discuss your second question

Question :

"This is only when I know which columns will be treated as null datatype. When a large number of files are being read and applied various transformations on, then I wouldn't know or is there a way I might know which fields are null treated? "

Ans :

In this case you can use option

The Databricks Scala style guide does not agree that null should always be banned from Scala code and says: “For performance sensitive code, prefer null over Option, in order to avoid virtual method calls and boxing.”

Example :

+------+
|number|
+------+
|     1|
|     8|
|    12|
|  null|
+------+


val actualDf = sourceDf.withColumn(
  "is_even",
  when(
    col("number").isNotNull, 
    isEvenSimpleUdf(col("number"))
  ).otherwise(lit(null))
)

actualDf.show()
+------+-------+
|number|is_even|
+------+-------+
|     1|  false|
|     8|   true|
|    12|   true|
|  null|   null|
+------+-------+

https://medium.com/@mrpowers/dealing-with-null-in-spark-cfdbb12f231e
https://github.com/vaquarkhan/scala-style-guide

126

answered Nov 30 '22 20:11

vaquar khan

Related questions
                            
                                Is this a regression bug in Spark 1.3?
                            
                                SparkSQL DataFrame order by across partitions
                            
                                How to load csv file into SparkR on RStudio?
                            
                                How to explain TreeNode type restriction and self-type in Spark's TreeNode?
                            
                                Does Spark SQL do predicate pushdown on filtered equi-joins?
                            
                                How to process the different graph files to be processed independently in between the cluster nodes in Apache Spark?
                            
                                Unable to create dataframe from RDD of Row using case class
                            
                                SQL: Can a single OVER clause support multiple window functions?
                            
                                cast schema of a data frame in Spark and Scala
                            
                                Spark Exception when converting a MySQL table to parquet
                            
                                PySpark, Decision Trees (Spark 2.0.0)
                            
                                Spark Dataframes: Skewed Partition after Join
                            
                                Spark, Scala - How to get Top 3 value from each group of two column in dataframe [duplicate]
                            
                                How to remove milliseconds in timestamp spark sql
                            
                                What is going wrong with `unionAll` of Spark `DataFrame`?
                            
                                Spark SQL DataFrame - distinct() vs dropDuplicates()
                            
                                Reading CSV into a Spark Dataframe with timestamp and date types
                            
                                Spark SQL window function with complex condition
                            
                                How to split a list to multiple columns in Pyspark?
                            
                                How to convert column with string type to int form in pyspark data frame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With