I have these (untidy) data containing each patient's drug regimen phase (ip or cp), drug name (coded as a number), and dose information for multiple drugs:
df_have
# id ip_drug1 ip_dose1 ip_drug2 ip_dose2 cp_drug1 cp_dose1 cp_drug2 cp_dose2
# 1 A1 1 300 3 100 6 500 7 100
# 2 A2 1 300 2 200 11 300 NA NA
# 3 A3 1 500 NA NA 9 100 5 1500
I would like to make these data tidy and in long format:
df_want
# id phase drug dose
# 1 A1 ip 1 300
# 2 A1 ip 3 100
# 3 A1 cp 6 500
# 4 A1 cp 7 100
# 5 A2 ip 1 300
# 6 A2 ip 2 200
# 7 A2 cp 11 300
# 8 A2 cp NA NA
# 9 A3 ip 1 500
# 10 A3 ip NA NA
# 11 A3 cp 9 100
# 12 A3 cp 5 1500
I was able to get the desired data frame by a combination of tidyr::pivot_longer, dplyr::mutate, and tidyr::pivot_wider (and dplyr::select):
library(tidyr)
library(dplyr)
df_have %>%
pivot_longer(cols = -id,
names_to = c("phase", "type"),
names_pattern = "(cp|ip)_(drug|dose)") %>%
mutate(temp = row_number(),
.by = c(id, phase, type)) %>%
pivot_wider(names_from = type,
values_from = value) %>%
select(-temp)
However the above multi-step code is very slow on my very large real data. I would like to accomplish this transformation faster within tidyr/dplyr, ideally in a single pivot_wider step. Is this possible?
Reproducible df_have and df_want data frames:
# have
df_have <- data.frame(id = paste0("A", 1:3),
ip_drug1 = 1,
ip_dose1 = c(300, 300, 500),
ip_drug2 = c(3, 2, NA),
ip_dose2 = c(100, 200, NA),
cp_drug1 = c(6, 11, 9),
cp_dose1 = c(500, 300, 100),
cp_drug2 = c(7, NA, 5),
cp_dose2 = c(100, NA, 1500))
# want
df_want <- data.frame(id = rep(paste0("A", 1:3), each = 4),
phase = rep(rep(c("ip", "cp"), each = 2), times = 3),
drug = c(1, 3, 6, 7, 1, 2, 11, NA, 1, NA, 9, 5),
dose = c(300, 100, 500, 100, 300, 200, 300, NA, 500, NA, 100, 1500))
It is possible in one pivot_longer step. This should be around 4.5x faster:
library(tidyr)
df_have |>
pivot_longer(-id,
names_pattern = "(.*?)_(.*?)\\d",
names_to = c("phase", ".value"))
I think the key is the use of .value in the names_to argument. From ?pivot_longer:
".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding
values_toentirely.
Benchmark
The solution by @ThomasIsCoding is even faster (~ 12x)
one_pivot <- function() {
df_have |>
pivot_longer(-id,
names_pattern = "(.*?)_(.*?)\\d",
names_to = c("phase", ".value"))
}
current <- function() {
df_have %>%
pivot_longer(cols = -id,
names_to = c("phase", "type"),
names_pattern = "(cp|ip)_(drug|dose)") %>%
mutate(temp = row_number(),
.by = c(id, phase, type)) %>%
pivot_wider(names_from = type,
values_from = value) %>%
select(-temp)
}
base <- function() {
out <- reshape(
setNames(
df_have,
gsub("(\\D+)_(\\D+)", "\\2_\\1", names(df_have))
),
direction = "long",
idvar = "id",
varying = -1,
sep = "_",
timevar = "phase"
)
transform(
`row.names<-`(out[order(out$id), ], NULL),
phase = sub("\\d+$", "", phase)
)
}
bench::mark(
current(),
one_pivot(),
base(),
relative = TRUE,
check = FALSE
)
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 current() 11.4 11.4 1 Inf 1.04 27 2 463ms <NULL> <Rprofmem [141 × 3]> <bench_tm> <tibble>
2 one_pivot() 2.54 2.67 4.43 Inf 1 124 2 480ms <NULL> <Rprofmem [22 × 3]> <bench_tm> <tibble>
3 base() 1 1 11.7 NaN 1.52 322 3 473ms <NULL> <Rprofmem [0 × 3]> <bench_tm> <tibble>
Output
id phase drug dose
<chr> <chr> <dbl> <dbl>
1 A1 ip 1 300
2 A1 ip 3 100
3 A1 cp 6 500
4 A1 cp 7 100
5 A2 ip 1 300
6 A2 ip 2 200
7 A2 cp 11 300
8 A2 cp NA NA
9 A3 ip 1 500
10 A3 ip NA NA
11 A3 cp 9 100
12 A3 cp 5 1500
Here is a base R solution with one single reshape
out <- reshape(
setNames(
df_have,
gsub("(\\D+)_(\\D+)", "\\2_\\1", names(df_have))
),
direction = "long",
idvar = "id",
varying = -1,
sep = "_",
timevar = "phase"
)
df_want <- transform(
`row.names<-`(out[order(out$id), ], NULL),
phase = sub("\\d+$", "", phase)
)
which gives
> df_want
id phase drug dose
1 A1 ip 1 300
2 A1 ip 3 100
3 A1 cp 6 500
4 A1 cp 7 100
5 A2 ip 1 300
6 A2 ip 2 200
7 A2 cp 11 300
8 A2 cp NA NA
9 A3 ip 1 500
10 A3 ip NA NA
11 A3 cp 9 100
12 A3 cp 5 1500
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With