Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use salting technique for joining data frames having skewed data

I am new to spark and trying to understand how to deal with skewed data in spark. I have created two tables employee and department. Employee has skewed data for one of the department.

One of the solution is to broadcast the department table and that works fine. But I want to understand how could I use salting technique in below code to improve performance.

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("skewTestSpark").config("spark.sql.warehouse.dir",
                                       '/user/hive/warehouse').enableHiveSupport().getOrCreate()
df1 = spark.sql("select * from spark.employee")
df2 = spark.sql("select id as dept_id, name as dept_name from spark.department")
res = df1.join(df2, df1.department==df2.dept_id)
res.write.parquet("hdfs://<host>:<port>/user/result/employee")

Distribution for above code: enter image description here

like image 359
Ankur Avatar asked Oct 23 '25 18:10

Ankur


1 Answers

It's unlikely that employees - even with skew - will cause a Spark bottleneck. In fact the example is flawed. Think of large large JOINs and not something that will fit into broadcast join category.

Salting: With "Salting" on SQL join or Grouping etc. operation, the key is changed to redistribute data in an even manner so that processing time for whatever operation any given partition is similar.

An excellent example for a JOIN is here: https://dzone.com/articles/why-your-spark-apps-are-slow-or-failing-part-ii-da

Another good read I would recommend is here: https://godatadriven.com/blog/b-efficient-large-spark-optimisation/

I could explain it all, but the first link explains it well enough. Some experimentation is needed to get a better key distribution.

like image 127
thebluephantom Avatar answered Oct 26 '25 13:10

thebluephantom