What is the most efficient way to select distinct value from a spark dataframe?

Question

Of the various ways that you've tried, e.g. df.select('column').distinct(), df.groupby('column').count() etc., what is the most efficient way to extract distinct values from a column?

thebluephantom · Accepted Answer

It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.

This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.

DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.

Deku07 · Answer

for larger dataset , groupby is efficient method.

What is the most efficient way to select distinct value from a spark dataframe?

Tags:

apache-spark

apache-spark-sql

pyspark

Lorenzo Cazador

2 Answers

thebluephantom

Deku07

Recent Activity

Donate For Us

What is the most efficient way to select distinct value from a spark dataframe?

Tags:

apache-spark

apache-spark-sql

pyspark

Lorenzo Cazador

2 Answers

thebluephantom

Deku07

Related questions

Recent Activity

Donate For Us