In SQL something like
SELECT count(id), sum(if(column1 = 1, 1, 0)) from groupedTable
could be formulated to perform a count of the total records as well as filtered records in a single pass.
How can I perform this in spark-data-frame API? i.e. without needing to join back one of the counts to the original data frame.
Just use count for both cases:
df.select(count($"id"), count(when($"column1" === 1, true)))
If column is nullable you should correct for that (for example with coalesce or IS NULL, depending on the desired output).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With