How do column data types affect join performance in SPARK or Databricks environment?

Question

I was recently introduced to DBT tool. One downside of the tool is that you cannot create an identity column (surrogate keys) as sequences. You can only generate hash of columns to uniquely identify rows. Due to that reason, I was trying to find out what would be the impact of surrogate keys as a hash of different columns (string data type) compared to sequence numbers (integer data type) when joining tables in Spark or Databricks environment (Fact tables have surrogate keys from dimension tables as foreign keys. So, both table columns participating in join will have the same data types).

So far, I can only find optimisation techniques for joins by handling data skewness, broadcasting, reducing shuffling etc. Haven't seen anything related to the impact of column types on joins.

For example, as a best practice for BigQuery the recommendation is "use INT64 data types in joins to reduce cost and improve comparison performance".

So, to elaborate my question: Does Integer data type have better join performance than string data type when joining tables with Databricks SQL (or Spark SQL)? Or column types have almost no impact?

I read a lot of blogs on performance optimisation for SPARK and Databricks. None of them mentioned the importance of column data types.

Tim Armstrong · Accepted Answer

Using numeric data types instead of strings is almost always going to be at least a little faster because they are the "native" data type for processors. Most numeric operations (arithmetic, comparison, etc) are a single, fast CPU instruction. With SIMD vectorization, a single CPU instruction can even process multiple values at a time. The Photon engine in Databricks SQL can make very effective use of both of these things, depending on the query.

In contrast, string operations always require multiple CPU instructions. These can still be very fast though, so in most cases there's nothing wrong with using strings!

But... usually the most important thing is the amount of data and the complexity of the operations on the data. E.g. optimisations like adding filters to the query, reducing the number of columns selected, the complexity of the query (number of joins, expressions, etc) will probably have a bigger impact than choice of data type. Processing 1GB of strings will probably be faster than processing 5GB of integers, for example.

Alex Ott · Answer

In addition to what Tim has written about, I want to add few more points about importance of using the right file formats. If you're using Delta Lake file format (default on Databricks), then following features are having effect on the join performance:

Data Skipping - Delta is collecting min/max statistics about first N columns (default: 32, configurable) and writing it to the Delta log. when performing join, this statistics will be used to find which files may contain the data. By decreasing the number of files to read you can greatly improve performance. But min/max statistics isn't very efficient for strings, and better for numeric & date/time data types
ZOrder - when you perform OPTIMIZE ZORDER BY column(s) command, Delta will colocate related data in as few files as possible, so when you're doing joins, it may read less files and this also should improve performance
Bloom filters may also heavily improve join performance because they can say in which files the exact value is stored - it's more efficient than min/max stats with big value ranges. Bloom filters will work with strings as well.

How do column data types affect join performance in SPARK or Databricks environment?

Tags:

join

apache-spark

apache-spark-sql

pyspark

databricks-sql

Jeyhun

2 Answers

Tim Armstrong

Alex Ott

Recent Activity

Donate For Us

How do column data types affect join performance in SPARK or Databricks environment?

Tags:

join

apache-spark

apache-spark-sql

pyspark

databricks-sql

Jeyhun

2 Answers

Tim Armstrong

Alex Ott

Related questions

Recent Activity

Donate For Us