Parallellise a custom function with PySpark

Question

I'm familiar with using UDFs to apply a custom function row-by-row to a DataFrame. However, I would like to know how to apply a custom function to different subsets of my DataFrame in parallel.

Here's a simplified example:

import numpy as np
import pandas as pd

dummy_data = pd.DataFrame({'id':np.random.choice(['a','b','c'],size=100),
                           'val':np.random.normal(size=100)})

My custom function takes an array of numbers as an input. For each unique 'id', I want to apply my function to the array of 'val' values associated with that id.

The simplistic way I'm doing it right now is to loop over my PySpark DataFrame, and for each 'id' convert the data to a pandas DataFrame, then apply the function. It works, but obviously it's slow and makes no use of spark.

How can I parallellise this?

Florent F · Accepted Answer

This answer is so short that it should rather be a comment but not enough reputation to comment.

Spark 2.3 introduced pandas vectorized UDFs that are exactly what you're looking for: executing a custom pandas transformation over a grouped Spark DataFrame, in a distributed fashion, and with great performance thanks to PyArrow serialization.

See

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?pyspark.sql.functions.pandas_udf#pyspark.sql.functions.pandas_udf
Using Collect_set after exploding in a groupedBy object in Pyspark

for more information and examples.

Parallellise a custom function with PySpark

Tags:

python

pyspark

Dan

1 Answers

Florent F

Recent Activity

Donate For Us

Parallellise a custom function with PySpark

Tags:

python

pyspark

Dan

1 Answers

Florent F

Related questions

Recent Activity

Donate For Us