Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala Dataframe describe non numeric columns

Is there a function similar to describe() for non numeric columns?

I'd like to gather stats about the 'data completeness' of my table. E.G.

  • total number of records
  • total number of null values
  • total number of special values (e.g. 0s, empty strings, etc...)
  • total number of distinct values
  • other stuff like this...

data.describe() produces interesting values (count, mean, stddev, min, max) for numeric columns only. Is there anything that works well with Strings or other types?

like image 507
Marsellus Wallace Avatar asked Dec 05 '25 10:12

Marsellus Wallace


1 Answers

There isn't. The problem is that basics statistics on numerical data are cheap. On categorical data some of these may require multiple data scans and unbounded (linear in terms of the number of records) memory.

Some are very cheap. For example counting NULL or empty: Count number of non-NaN entries in each column of Spark dataframe with Pyspark

like image 124
user7735143 Avatar answered Dec 07 '25 06:12

user7735143