Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pivot_longer for multiple untidy columns

Tags:

r

tidyr

I have these (untidy) data containing each patient's drug regimen phase (ip or cp), drug name (coded as a number), and dose information for multiple drugs:

df_have

#   id ip_drug1 ip_dose1 ip_drug2 ip_dose2 cp_drug1 cp_dose1 cp_drug2 cp_dose2
# 1 A1        1      300        3      100        6      500        7      100
# 2 A2        1      300        2      200       11      300       NA       NA
# 3 A3        1      500       NA       NA        9      100        5     1500

I would like to make these data tidy and in long format:

df_want

#    id phase drug dose
# 1  A1    ip    1  300
# 2  A1    ip    3  100
# 3  A1    cp    6  500
# 4  A1    cp    7  100
# 5  A2    ip    1  300
# 6  A2    ip    2  200
# 7  A2    cp   11  300
# 8  A2    cp   NA   NA
# 9  A3    ip    1  500
# 10 A3    ip   NA   NA
# 11 A3    cp    9  100
# 12 A3    cp    5 1500

I was able to get the desired data frame by a combination of tidyr::pivot_longer, dplyr::mutate, and tidyr::pivot_wider (and dplyr::select):

library(tidyr)
library(dplyr)

df_have %>% 
  pivot_longer(cols = -id, 
               names_to = c("phase", "type"),
               names_pattern = "(cp|ip)_(drug|dose)") %>%
  mutate(temp = row_number(), 
         .by = c(id, phase, type)) %>%
  pivot_wider(names_from = type, 
              values_from = value) %>%
  select(-temp)

However the above multi-step code is very slow on my very large real data. I would like to accomplish this transformation faster within tidyr/dplyr, ideally in a single pivot_wider step. Is this possible?


Reproducible df_have and df_want data frames:

# have
df_have <- data.frame(id = paste0("A", 1:3), 
                 ip_drug1 = 1,
                 ip_dose1 = c(300, 300, 500),
                 ip_drug2 = c(3, 2, NA),
                 ip_dose2 = c(100, 200, NA),
                 cp_drug1 = c(6, 11, 9),
                 cp_dose1 = c(500, 300, 100),
                 cp_drug2 = c(7, NA, 5),
                 cp_dose2 = c(100, NA, 1500))

# want
df_want <- data.frame(id = rep(paste0("A", 1:3), each = 4),
                      phase = rep(rep(c("ip", "cp"), each = 2), times = 3),
                      drug = c(1, 3, 6, 7, 1, 2, 11, NA, 1, NA, 9, 5),
                      dose = c(300, 100, 500, 100, 300, 200, 300, NA, 500, NA, 100, 1500))
like image 337
jpsmith Avatar asked Oct 31 '25 14:10

jpsmith


2 Answers

It is possible in one pivot_longer step. This should be around 4.5x faster:

library(tidyr)

df_have |>
  pivot_longer(-id,
               names_pattern = "(.*?)_(.*?)\\d",
               names_to = c("phase", ".value")) 

I think the key is the use of .value in the names_to argument. From ?pivot_longer:

".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely.

Benchmark

The solution by @ThomasIsCoding is even faster (~ 12x)

one_pivot <- function() {
  df_have |>
    pivot_longer(-id,
                 names_pattern = "(.*?)_(.*?)\\d",
                 names_to = c("phase", ".value"))
}

current <- function() {
  df_have %>% 
    pivot_longer(cols = -id, 
                 names_to = c("phase", "type"),
                 names_pattern = "(cp|ip)_(drug|dose)") %>%
    mutate(temp = row_number(), 
           .by = c(id, phase, type)) %>%
    pivot_wider(names_from = type, 
                values_from = value) %>%
    select(-temp)
}

base <- function() {
  out <- reshape(
    setNames(
      df_have,
      gsub("(\\D+)_(\\D+)", "\\2_\\1", names(df_have))
    ),
    direction = "long",
    idvar = "id",
    varying = -1,
    sep = "_",
    timevar = "phase"
  )
  
  transform(
    `row.names<-`(out[order(out$id), ], NULL),
    phase = sub("\\d+$", "", phase)
  )
}

bench::mark(
  current(),
  one_pivot(),
  base(),
  relative = TRUE,
  check = FALSE
)
  
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time       gc      
  <bch:expr>  <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>     <list>  
1 current()   11.4   11.4       1          Inf     1.04    27     2      463ms <NULL> <Rprofmem [141 × 3]> <bench_tm> <tibble>
2 one_pivot()  2.54   2.67      4.43       Inf     1      124     2      480ms <NULL> <Rprofmem [22 × 3]>  <bench_tm> <tibble>
3 base()       1      1        11.7        NaN     1.52   322     3      473ms <NULL> <Rprofmem [0 × 3]>   <bench_tm> <tibble>

Output

   id    phase  drug  dose
   <chr> <chr> <dbl> <dbl>
 1 A1    ip        1   300
 2 A1    ip        3   100
 3 A1    cp        6   500
 4 A1    cp        7   100
 5 A2    ip        1   300
 6 A2    ip        2   200
 7 A2    cp       11   300
 8 A2    cp       NA    NA
 9 A3    ip        1   500
10 A3    ip       NA    NA
11 A3    cp        9   100
12 A3    cp        5  1500
like image 100
LMc Avatar answered Nov 03 '25 07:11

LMc


Here is a base R solution with one single reshape

out <- reshape(
    setNames(
        df_have,
        gsub("(\\D+)_(\\D+)", "\\2_\\1", names(df_have))
    ),
    direction = "long",
    idvar = "id",
    varying = -1,
    sep = "_",
    timevar = "phase"
)

df_want <- transform(
    `row.names<-`(out[order(out$id), ], NULL),
    phase = sub("\\d+$", "", phase)
)

which gives

> df_want
   id phase drug dose
1  A1    ip    1  300
2  A1    ip    3  100
3  A1    cp    6  500
4  A1    cp    7  100
5  A2    ip    1  300
6  A2    ip    2  200
7  A2    cp   11  300
8  A2    cp   NA   NA
9  A3    ip    1  500
10 A3    ip   NA   NA
11 A3    cp    9  100
12 A3    cp    5 1500
like image 44
ThomasIsCoding Avatar answered Nov 03 '25 05:11

ThomasIsCoding



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!