Is there a function similar to describe() for non numeric columns?
I'd like to gather stats about the 'data completeness' of my table. E.G.
data.describe() produces interesting values (count, mean, stddev, min, max) for numeric columns only. Is there anything that works well with Strings or other types?
There isn't. The problem is that basics statistics on numerical data are cheap. On categorical data some of these may require multiple data scans and unbounded (linear in terms of the number of records) memory.
Some are very cheap. For example counting NULL or empty: Count number of non-NaN entries in each column of Spark dataframe with Pyspark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With