Difference between approxCountDsitinct and approx_count_distinct in spark functions

Question

Can anyone tell the difference between pyspark.sql.functions.approxCountDistinct (I know it is deprecated) and pyspark.sql.functions.approx_count_distinct? I have used both versions in a project and have experienced different values

Oli · Accepted Answer

As you mentioned it, pyspark.sql.functions.approxCountDistinct is deprecated. The reason is most likely just a style concern. They probably wanted everything to be in snake case. As you can see in the source code pyspark.sql.functions.approxCountDistinct simply calls pyspark.sql.functions.approx_count_distinct, nothing more except giving you a warning. So regardless the one you use, the very same code runs in the end.

Also, still according to the source code, approx_count_distinct is based on the HyperLogLog++ algorithm. I am not very familiar with the algorithm but it is based on repetitive set merging. Therefore, the result will most likely depend on the order in which the various results of the executors are merged. Since this is not deterministic with spark, this could explain why you witness different results.

Difference between approxCountDsitinct and approx_count_distinct in spark functions

Tags:

python

apache-spark

pyspark

saiyam

1 Answers

Oli

Recent Activity

Donate For Us

Difference between approxCountDsitinct and approx_count_distinct in spark functions

Tags:

python

apache-spark

pyspark

saiyam

1 Answers

Oli

Related questions

Recent Activity

Donate For Us