I use Spark SQL 2.2.0.
When executing query such as:
spark.sql("select COL1, min(STRING_COL2)
from TB1 group by COL1").explain()
Spark will use sort aggregate since STRING_COL2 is a string column. In most cases sort based aggregation is much more expensive than hash based aggregation.
Specifying string column in the GROUP BY clause will not force sort based aggregation.
If you replace min(STRING_COL1) with sort_array(collect_set(STRING_COL1))[0], Spark will use ObjectHashAggregation which is much better than SortAggregate (two times faster in my case).
However, collecting a set of distinct values, sorting it, and finally taking the first value require more memory and consume more CPU resources than just comparing two values (as MIN is supposed to do). In addition, ObjectHashAggregation will fallback to SortAggregate if to many entries are aggregated.
How can I avoid the heavy sort while with increasing memory consumption? Why MIN and MAX of string columns are not supported by HashAggregate? When will it be supported?
Thanks.
Possibly too late to answer. Anyways what i found from the code is that only limited set of functions use object hash and the list can be found in the link below. So functions like min and max still use sort aggregate if datatype is unmutable. The reason for other functions like min and max to not derive from TypedInmperativeAggregative the way functions like percentile etc. do is because they accept expressions not just column names.
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Expression-TypedImperativeAggregate.html
Hash-based aggregation is default, but it may fallback to sort-based aggregation when there are too many keys in GROUP BY, exceeding the buffer size of hash-based aggregation.
See this blog.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With