Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why summary( a_categorical_var) does not show "NA" counts?

Tags:

r

summary

I want to examine my dataset - flights, and use summary() function.

summary(flights["tailnum"])

Results:

   tailnum         
 Length:336776     
 Class :character  
 Mode  :character  

In particular, it does not show that the character variable tailnum has any NAs.

However, when I use sum(is.na(flights$tailnum)), it shows it has NAs.

[1] 2512

What is the best function to examine a categorical variable - show its levels, missing values, total number of rows and frequencies for each level?

like image 511
Athenaj Avatar asked Jan 25 '26 11:01

Athenaj


1 Answers

Apparently the summary() method for character variables doesn't report NAs. (This does seem a bit inconsistent, might be worth reporting/discussing on the [email protected] mailing list ...)

If you convert the variable to a factor and apply summary() to it specifically you'll get a table of the counts of the first 98 levels (followed by an "Other" category and the number of NAs).

summary(factor(flights$tailnum))

If you really want a full tabulation:

tt <- table(flights$tailnum, useNA = "ifany")
print(tt)

Although length(tt) is 4044, telling you that there are 4043 distinct non-NA values (+ NA values): head(table(tt)) and tail(table(tt)) tell you that there are hundreds of values that occur only a few times, and a few values that occur hundreds (or thousands) of times.

If you're using tidyverse and want to convert all character variables to factors:

flights %>% mutate(across(where(is.character), factor))
like image 106
Ben Bolker Avatar answered Jan 28 '26 00:01

Ben Bolker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!