Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between approxCountDsitinct and approx_count_distinct in spark functions

Can anyone tell the difference between pyspark.sql.functions.approxCountDistinct (I know it is deprecated) and pyspark.sql.functions.approx_count_distinct? I have used both versions in a project and have experienced different values

like image 680
saiyam Avatar asked Sep 18 '25 12:09

saiyam


1 Answers

As you mentioned it, pyspark.sql.functions.approxCountDistinct is deprecated. The reason is most likely just a style concern. They probably wanted everything to be in snake case. As you can see in the source code pyspark.sql.functions.approxCountDistinct simply calls pyspark.sql.functions.approx_count_distinct, nothing more except giving you a warning. So regardless the one you use, the very same code runs in the end.

Also, still according to the source code, approx_count_distinct is based on the HyperLogLog++ algorithm. I am not very familiar with the algorithm but it is based on repetitive set merging. Therefore, the result will most likely depend on the order in which the various results of the executors are merged. Since this is not deterministic with spark, this could explain why you witness different results.

like image 97
Oli Avatar answered Sep 20 '25 02:09

Oli