I want to examine my dataset - flights, and use summary() function.
summary(flights["tailnum"])
Results:
tailnum
Length:336776
Class :character
Mode :character
In particular, it does not show that the character variable tailnum has any NAs.
However, when I use sum(is.na(flights$tailnum)), it shows it has NAs.
[1] 2512
What is the best function to examine a categorical variable - show its levels, missing values, total number of rows and frequencies for each level?
Apparently the summary() method for character variables doesn't report NAs. (This does seem a bit inconsistent, might be worth reporting/discussing on the [email protected] mailing list ...)
If you convert the variable to a factor and apply summary() to it specifically you'll get a table of the counts of the first 98 levels (followed by an "Other" category and the number of NAs).
summary(factor(flights$tailnum))
If you really want a full tabulation:
tt <- table(flights$tailnum, useNA = "ifany")
print(tt)
Although length(tt) is 4044, telling you that there are 4043 distinct non-NA values (+ NA values): head(table(tt)) and tail(table(tt)) tell you that there are hundreds of values that occur only a few times, and a few values that occur hundreds (or thousands) of times.
If you're using tidyverse and want to convert all character variables to factors:
flights %>% mutate(across(where(is.character), factor))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With