I have a dataframe containing float
and double
values.
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> df.printSchema
root
|-- x: float (nullable = false)
|-- y: double (nullable = false)
When I replace NaN
values with null
value, I gave null
as String to the Map in fill
operation.
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
And I got correct value. But I was not able to understand as to How/Why Spark SQL is translating null
as a String to a null
object ?
If you looked in to the fill
function in Dataset
, It checks the datatype and tries to convert to datatype of its column's schema. If it can be converted then it converts otherwise it returns null.
It does not convert to "null
" to object null
but it returns null if exception occurs while converting.
val map = df.columns.map((_, "WHATEVER")).toMap
gives null
and val map = df.columns.map((_, "9999.99")).toMap
gives 9999.99
If you want to update the NAN
with same datatype, you can get result as expected.
Hope this helps you to understand!
It's not that "null" as a String translates to a null
object. You can try using the transformation with any String and still get null
(with the exception of strings that can directly be cast to double/float, see below). For example, using
val map = df.columns.map((_, "abc")).toMap
would give the same result. My guess is that as the columns are of type float
and double
converting the NaN
values to a string will give null
. Using a number instead would work as expected, e.g.
val map = df.columns.map((_, 1)).toMap
As some strings can directly be cast to double
or float
those are also possible to use in this case.
val map = df.columns.map((_, "1")).toMap
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With