Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark salting: replace null in column with random negative values

I have many columns that I'm performing joins on which can sometimes contain billions of rows of null values, so I would like to salt the columns to prevent skew after the join like mentioned in Jason Evan's post: https://stackoverflow.com/a/43394695

I cannot find an equivalent example of this in Python, and the syntax is just different enough that I can't figure out how to translate it.

I approximately have this:

import pyspark.sql.functions as psf
big_neg = -200
for column in key_fields: #key_fields is a list of join keys in the dataframe
    df = df.withColumn(column,
                       psf.when(psf.col(column).isNull(),
                                psf.round(psf.rand().multiply(big_neg))
                      ).otherwise(df[column]))

This is currently failing on a syntax error:

TypeError: 'Column' object is not callable

but I have already tried many syntax combinations to get rid of the typeError and am stumped.

like image 323
skewed_to_death_94 Avatar asked Dec 21 '25 21:12

skewed_to_death_94


1 Answers

I was actually able to figure it out after taking a break.

I thought it would be helpful for anyone else who encounters this problem, so I will post my solution:

df = df.withColumn(column, psf.when(df[column].isNull(), psf.round(psf.rand()*(big_neg))).otherwise(df[column]))
like image 112
skewed_to_death_94 Avatar answered Dec 23 '25 10:12

skewed_to_death_94