Of the various ways that you've tried, e.g. df.select('column').distinct()
, df.groupby('column').count()
etc., what is the most efficient way to extract distinct values from a column?
It does not matter as you can see in this excellent reference https://www.waitingforcode.com/apache-spark-sql/distinct-vs-group-by-key-difference/read.
This is because Apache Spark has a logical optimization rule called ReplaceDistinctWithAggregate that will transform an expression with distinct keyword by an aggregation.
DISTINCT and GROUP BY in simple contexts of selecting unique values for a column, execute the same way, i.e. as an aggregation.
for larger dataset , groupby is efficient method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With