Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use countDistinct using a window function in Spark/Scala?

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -

rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported: 

Is there any workaround for this ?

like image 667
user3407267 Avatar asked Oct 21 '25 04:10

user3407267


1 Answers

Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.

Example:

df.show()
//+---+---+---+
//|  i|  j|  k|
//+---+---+---+
//|  1|  a|  c|
//|  2|  b|  d|
//|  1|  a|  c|
//|  2|  b|  e|
//+---+---+---+

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window.partitionBy("i","j")

df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()

//or using approx_count_distinct

df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()

//+---+---+---+---+
//|  i|  j|  k|cnt|
//+---+---+---+---+
//|  2|  b|  d|  2|
//|  2|  b|  e|  2|
//|  1|  a|  c|  1| //as c value repeated for 1,a partition
//|  1|  a|  c|  1|
//+---+---+---+---+
like image 186
notNull Avatar answered Oct 23 '25 19:10

notNull



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!