I am trying to calculate descriptive statistics for the birthweight data set (birthwt) found in RStudio. However, I'm only interested in a few variables: age, ftv, ptl and lwt. 
This is the code I have so far:
library(MASS)
library(dplyr)
data("birthwt")
grouped <- group_by(birthwt, age, ftv, ptl, lwt)
summarise(grouped, 
          mean = mean(bwt),
          median = median(bwt),
          SD = sd(bwt))
It gives me a pretty-printed table but only a limited number of the SD is filled and the rest say NA. I just can't work out why or how to fix it!
Method 1: Using summarise_all() method The summarise_all method in R is used to affect every column of the data frame. The output data frame returns all the columns of the data frame where the specified function is applied over every column.
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
The function n() returns the number of observations in a current group.
I stumbled here for another reason and also for me, the answer comes from the docs:
# BEWARE: reusing variables may lead to unexpected results
mtcars %>%
    group_by(cyl) %>%
    summarise(disp = mean(disp), sd = sd(disp))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#>     cyl  disp    sd
#>   <dbl> <dbl> <dbl>
#> 1     4  105.    NA
#> 2     6  183.    NA
#> 3     8  353.    NA
So, in case someone has the same reason as me, instead of reusing a variable, create new ones:
mtcars %>%
group_by(cyl) %>%
summarise(
    disp_mean = mean(disp),
    disp_sd = sd(disp)
)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
    cyl disp_mean disp_sd
  <dbl>     <dbl>   <dbl>
1     4      105.    26.9
2     6      183.    41.6
3     8      353.    67.8
The number of rows for some of the groups are 1.
grouped %>% 
     summarise(n = n())
# A tibble: 179 x 5
# Groups: age, ftv, ptl [?]
#     age   ftv   ptl   lwt     n
#   <int> <int> <int> <int> <int>
# 1    14     0     0   135     1
# 2    14     0     1   101     1
# 3    14     2     0   100     1
# 4    15     0     0    98     1
# 5    15     0     0   110     1
# 6    15     0     0   115     1
# 7    16     0     0   110     1
# 8    16     0     0   112     1
# 9    16     0     0   135     2
#10    16     1     0    95     1
According to ?sd, 
The standard deviation of a length-one vector is NA.
This results in NA values for the sd where there is only one element
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With