How to use countDistinct using a window function in Spark/Scala?

Question

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -

rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:

Is there any workaround for this ?

rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:

Is there any workaround for this ?

notNull · Accepted Answer

Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.

Example:

df.show()
//+---+---+---+
//|  i|  j|  k|
//+---+---+---+
//|  1|  a|  c|
//|  2|  b|  d|
//|  1|  a|  c|
//|  2|  b|  e|
//+---+---+---+

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window.partitionBy("i","j")

df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()

//or using approx_count_distinct

df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()

//+---+---+---+---+
//|  i|  j|  k|cnt|
//+---+---+---+---+
//|  2|  b|  d|  2|
//|  2|  b|  e|  2|
//|  1|  a|  c|  1| //as c value repeated for 1,a partition
//|  1|  a|  c|  1|
//+---+---+---+---+

How to use countDistinct using a window function in Spark/Scala?

Tags:

count

scala

apache-spark

user3407267

1 Answers

notNull

Recent Activity

Donate For Us

How to use countDistinct using a window function in Spark/Scala?

Tags:

count

scala

apache-spark

user3407267

1 Answers

notNull

Related questions

Recent Activity

Donate For Us