Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient way to select distinct value from a spark dataframe?

Of the various ways that you've tried, e.g. df.select('column').distinct(), df.groupby('column').count() etc., what is the most efficient way to extract distinct values from a column?

like image 355
Lorenzo Cazador Avatar asked Sep 07 '25 00:09

Lorenzo Cazador


2 Answers

It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.

This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.

DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.

like image 58
thebluephantom Avatar answered Sep 10 '25 03:09

thebluephantom


for larger dataset , groupby is efficient method.

like image 27
Deku07 Avatar answered Sep 10 '25 01:09

Deku07