Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallellise a custom function with PySpark

Tags:

python

pyspark

I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.

Here's a simplified example:

import numpy as np
import pandas as pd

dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
                           'val':np.random.normal(size=100)})

My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.

The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.

How can I parallellise this?

like image 773
Dan Avatar asked Oct 26 '25 00:10

Dan


1 Answers

This answer is so short that it should rather be a comment but not enough reputation to comment.

Spark 2.3 introduced pandas vectorized UDFs that are exactly what you're looking for: executing a custom pandas transformation over a grouped Spark DataFrame, in a distributed fashion, and with great performance thanks to PyArrow serialization.

See

  • https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
  • http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?pyspark.sql.functions.pandas_udf#pyspark.sql.functions.pandas_udf
  • Using Collect_set after exploding in a groupedBy object in Pyspark

for more information and examples.

like image 86
Florent F Avatar answered Oct 27 '25 17:10

Florent F



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!