Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Spark SQL translates String "null" to Object null for Float/Double types?

I have a dataframe containing float and double values.

scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]

scala> df.show
+---+---+
|  x|  y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+

scala> df.printSchema
root
 |-- x: float (nullable = false)
 |-- y: double (nullable = false)

When I replace NaN values with null value, I gave null as String to the Map in fill operation.

scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)

scala> df.na.fill(map).printSchema
root
 |-- x: float (nullable = true)
 |-- y: double (nullable = true)


scala> df.na.fill(map).show
+----+----+
|   x|   y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+

And I got correct value. But I was not able to understand as to How/Why Spark SQL is translating null as a String to a null object ?

like image 883
himanshuIIITian Avatar asked Sep 08 '25 03:09

himanshuIIITian


2 Answers

If you looked in to the fill function in Dataset, It checks the datatype and tries to convert to datatype of its column's schema. If it can be converted then it converts otherwise it returns null.

It does not convert to "null" to object null but it returns null if exception occurs while converting.

val map = df.columns.map((_, "WHATEVER")).toMap

gives null

and val map = df.columns.map((_, "9999.99")).toMap

gives 9999.99

If you want to update the NAN with same datatype, you can get result as expected.

Hope this helps you to understand!

like image 140
koiralo Avatar answered Sep 10 '25 16:09

koiralo


It's not that "null" as a String translates to a null object. You can try using the transformation with any String and still get null (with the exception of strings that can directly be cast to double/float, see below). For example, using

val map = df.columns.map((_, "abc")).toMap

would give the same result. My guess is that as the columns are of type float and double converting the NaN values to a string will give null. Using a number instead would work as expected, e.g.

val map = df.columns.map((_, 1)).toMap

As some strings can directly be cast to double or float those are also possible to use in this case.

val map = df.columns.map((_, "1")).toMap
like image 31
Shaido Avatar answered Sep 10 '25 16:09

Shaido