Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing empty string with null leads to INCREASE in dataframe size?

I'm having trouble understanding the following phenomenon: in Spark 2.2, on Scala, I witness a significant incease in the persisted DataFrame size after replacing literal empty string values with lit(null).

This is the function I use to replace empty string values:

def nullifyEmptyStrings(df:DataFrame): DataFrame = {
    var in = df
    for (e <- df.columns) {
         in = in.withColumn(e, when(length(col(e))===0, lit(null:String)).otherwise(col(e)))
     }
    in
  }

I observe that the persisted (DISK_ONLY) size of my initial dataframe before running this function is 1480MB, and afterwards is 1610MB. The number of partitions remains unchanged.

Any thoughts? The nulling works fine by the way, but my main reason for introducing this was to reduce shuffle size, and it seems I only increase it this way.

like image 900
Chondrops Avatar asked Dec 14 '25 23:12

Chondrops


1 Answers

I'm going to answer this myself, as we have now done some investigation that might be useful to share.

Testing on large (10s of millions of rows) DataFrames with entirely String columns, we observe that replacing empty Strings with nulls results in a slight decrease of the overall disk footprint when serialized to parquet on S3 (1.1-1.5%).

However, dataframes cached either MEMORY_ONLY or DISK_ONLY were 6% and 8% larger respectively. I can only speculate how Spark is internally representing the NULL value when the Column is of StringType ... but whatever it is, its bigger than an empty string. If there's any way to inspect this I'll be glad to hear it.

The phenomenon is identical in PySpark and Scala.

Our goal in using nulls was to reduce shuffle size in a complex join action. Overall, we experienced the opposite. However we'll keep using nulls because the automatic pushdown of isNotNull filters makes writing joins much cleaner in Spark SQL.

like image 181
Chondrops Avatar answered Dec 16 '25 22:12

Chondrops



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!