I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -
rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:
Is there any workaround for this ?
Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.
Example:
df.show()
//+---+---+---+
//| i| j| k|
//+---+---+---+
//| 1| a| c|
//| 2| b| d|
//| 1| a| c|
//| 2| b| e|
//+---+---+---+
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("i","j")
df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()
//or using approx_count_distinct
df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()
//+---+---+---+---+
//| i| j| k|cnt|
//+---+---+---+---+
//| 2| b| d| 2|
//| 2| b| e| 2|
//| 1| a| c| 1| //as c value repeated for 1,a partition
//| 1| a| c| 1|
//+---+---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With