Can anyone tell the difference between pyspark.sql.functions.approxCountDistinct
(I know it is deprecated) and pyspark.sql.functions.approx_count_distinct
? I have used both versions in a project and have experienced different values
As you mentioned it, pyspark.sql.functions.approxCountDistinct
is deprecated. The reason is most likely just a style concern. They probably wanted everything to be in snake case. As you can see in the source code pyspark.sql.functions.approxCountDistinct
simply calls pyspark.sql.functions.approx_count_distinct
, nothing more except giving you a warning. So regardless the one you use, the very same code runs in the end.
Also, still according to the source code, approx_count_distinct
is based on the HyperLogLog++ algorithm. I am not very familiar with the algorithm but it is based on repetitive set merging. Therefore, the result will most likely depend on the order in which the various results of the executors are merged. Since this is not deterministic with spark, this could explain why you witness different results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With