Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data framing summary statistics in R

Tags:

dataframe

r

I need to create a XLSX file containing the summary statistics (as in the summary() function), but I am not being able to create a reliable way to separate each value (mean, median, NA's etc.) into separate rows for each variable from the original variables. Since my database has more than 200 variables, I do need to create a more systematic way, instead of manually deleting words in my XLSXoutput.

After some research, I found some partial solutions, such as:

x1 <- as.data.frame(do.call(cbind, lapply(df, summary, is.numeric)))
x2 <- data.frame(unclass(summary(df1)), check.names = FALSE, stringsAsFactors = FALSE)
x3 <- as.data.frame(apply(df,2,summary))
x4 <- data.frame(df1=matrix(df1),row.names=names(df1))

And what I need is something like this:

          y1      y2      y3       y4       y5
Min.    1.00    1.00    23.00    50.00    6.00
1st Qu. 31.75   3.75    30.50    57.25    11.75
Median  43.00   7.00    56.00    76.00    15.00
Mean    51.75   6.10    55.55    72.05    14.35
3rd Qu. 80.25   8.25    73.50    83.75    17.00
Max.    99.00   10.00    100.00  95.00    20.00

If someone would like to do some exercise, this database gives the same errors as my huge one:

x1 <- rpois(20,5)
x2 <- rexp(20,2)
x3 <- rexp(20,5); x3[1:10] <- NA_real_
x4 <- runif(20,5,10)
x5 <- runif(20,5,12)
df1 <- data.frame(x1,x2,x3,x4,x5)

Thanks in advance!

like image 257
Igor Mendonça Avatar asked Sep 15 '25 13:09

Igor Mendonça


1 Answers

considering an example dataframe with columns y1, y2, ..., yn to summarise:

library(tidyr)
library(dplyr)

data.frame(y1 = rnorm(100),
           y2 = runif(100) ##, ... yn
           ) %>%
pivot_longer(starts_with('y'),
             names_to = 'variable',
             values_to = 'value'
             ) %>%
    group_by(variable) %>%
    summarise(Min = min(value, na.rm = TRUE),
              Median = median(value, na.rm = TRUE) ##, ad libidum
              ) %>%
    pivot_longer(-variable) %>%
    pivot_wider(names_from = variable)

Generally, package {broom} offers convenient tidying of summaries into tibbles:

library(broom)
summary(1:10) %>% tidy
lm(displ ~ cyl, data = mpg) %>% tidy

or, if you want wide instead of long table format (as in your example):

library(broom)
library(tidyr)

summary(1:10) %>%
    tidy %>%
    pivot_longer(everything(),
                 names_to = 'stat',
                 values_to = 'value'
                 )

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!