Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark treating null values in csv column as null datatype

My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file.

For example, I have input csv as follows:

Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234

My transformation is:

Select Id, 
       FirstName, 
       LastName, 
       LocationId as PrimaryLocationId,
       null as SecondaryLocationId
from Input

(I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the datatype of SecondaryLocationId and returns null in the schema and throws the error CSV data source does not support null data type while writing to output csv.

Below are printSchema() and write options I am using.

root
     |-- Id: string (nullable = true)
     |-- FirstName: string (nullable = true)
     |-- LastName: string (nullable = true)
     |-- PrimaryLocationId: string (nullable = false)
     |-- SecondaryLocationId: null (nullable = true)

dataFrame.repartition(1).write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .option("delimiter", "|")
      .option("nullValue", "")
      .option("inferSchema", "true")
      .csv(outputPath)

Is there a way to default to a datatype (such as string)? By the way, I can get this to work by replacing null with empty string('') but that is not what I want to do.

like image 747
tturner Avatar asked Sep 27 '17 04:09

tturner


People also ask

How do I handle null values in a CSV file?

Empty Strings and NULL Values In CSV files, a NULL value is typically represented by two successive delimiters (e.g. ,, ) to indicate that the field contains no data; however, you can use string values to denote NULL (e.g. null ) or any unique string.

How do you replace null values in a column in PySpark?

In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

How do I change the null value in Spark?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

How does Spark ignore NULL values?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.


1 Answers

use lit(null): import org.apache.spark.sql.functions.{lit, udf}

Example:

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))


scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv writer. If it is a hard requirement you 
 can cast column to the specific type (lets say String):

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

reposting zero323 code.

Now lets discuss your second question

Question :

"This is only when I know which columns will be treated as null datatype. When a large number of files are being read and applied various transformations on, then I wouldn't know or is there a way I might know which fields are null treated? "

Ans :

In this case you can use option

The Databricks Scala style guide does not agree that null should always be banned from Scala code and says: “For performance sensitive code, prefer null over Option, in order to avoid virtual method calls and boxing.”

Example :

+------+
|number|
+------+
|     1|
|     8|
|    12|
|  null|
+------+


val actualDf = sourceDf.withColumn(
  "is_even",
  when(
    col("number").isNotNull, 
    isEvenSimpleUdf(col("number"))
  ).otherwise(lit(null))
)

actualDf.show()
+------+-------+
|number|is_even|
+------+-------+
|     1|  false|
|     8|   true|
|    12|   true|
|  null|   null|
+------+-------+
  • https://medium.com/@mrpowers/dealing-with-null-in-spark-cfdbb12f231e
  • https://github.com/vaquarkhan/scala-style-guide
like image 126
vaquar khan Avatar answered Nov 30 '22 20:11

vaquar khan