I'm working on a dataframe consisting of 528 column and 2,643,246 rows. Eight of these are character-variables, and the rest integers. In total, this adds up to 11.35 GiB of data, with my available RAM being at 164 GiB.
I now wanted to run a pivot_longer on said dataframe, having one row for each column + two ID variables (year and institution). There are a total of 671,370 institutions over 76 years.
So atm the data are structured such as this:
| Institution | Year | X | Y | Z |
|---|---|---|---|---|
| A | 1 | 2 | 1 | 3 |
| A | 2 | 3 | 4 | 4 |
| B | 1 | 3 | 4 | 2 |
| B | 2 | 5 | 3 | 2 |
Where I would like to change it so the structure becomes:
| Institution | Year | G | N |
|---|---|---|---|
| A | 1 | X | 2 |
| A | 1 | Y | 1 |
| A | 1 | Z | 3 |
| A | 2 | X | 3 |
| A | 2 | Y | 1 |
| A | 2 | Z | 4 |
| B | 1 | X | 3 |
| B | 1 | Y | 4 |
| B | 1 | Z | 2 |
| B | 2 | X | 5 |
| B | 2 | Y | 3 |
| B | 2 | Z | 2 |
To achieve this I attempted the following code:
library(tidyverse)
Df <- Df %>% pivot_longer(17:527,
names_to = "G",
values_to = "N"
)
When running this on a small sample-data I manage to achieve the expected results, however when attempting to do the same on the whole dataset I quickly run out of memory. From the object using 11 GiB of memory, it quickly increased to above 150 GiB before returning a "cannot allocate vector of size x Gb" error.
Since I haven't added any data, I can't quite understand where the extra memory usage is coming from. What I wonder therefore is what creates this increase, and whether there is a more efficient way to solve this problem through some other code. Thanks in advance for any help!
For this size data, reshape2 or data.table's pivoting will likely be more memory-efficient than tidyr's. On a smaller 800MB sample, reshape2::melt requires around 2.6x as much memory as the original data, data.table::melt requires around 3.6x as much, while the tidyr::pivot_longer approach in this case needs about 12x as much memory, and is about 20x slower.
EDIT: Upon review of the warnings, I realized that my earlier draft was actually using reshape2 under the hood when I called data.table::melt, since I was feeding it a tibble. Added explicit data.table::melt solution that includes a translation of the data to data.table. For this data, reshape2::melt seems to be faster and more memory efficient.
n1 = as.integer(2E8/26)
set.seed(42)
data_long <- data.frame(Institution = 1:n1,
G = rep(letters, each = n1),
Z = rpois(26*n1, 10))
data_wide <- data_long %>% pivot_wider(names_from = G, values_from = Z)
bench::mark(
tidyr::pivot_longer(data_wide, -1),
reshape2::melt(data_wide, id.vars = 1),
data.table::melt(data.table::as.data.table(data_wide), id.vars = 1),
check = FALSE,
iterations = 10
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 tidyr::pivot_longer(data_wide, -1) 26.77s 37.38s 0.0269 10.52GB 0.0538 10 20 6.2m
2 reshape2::melt(data_wide, id.vars = 1) 1.25s 1.73s 0.519 2.23GB 0.156 10 3 19.3s
3 data.table::melt(data.table::as.data.table(data_wide), id.vars = 1) 1.82s 2.41s 0.332 3.01GB 0.232 10 7 30.1s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
I have no ways to test the code on your data, but here is one idea.
The idea is to conduct the wide to long transformation for a chunk of rows one at a time, store the outcome in a list. In the end, combine the list to the final data frame. Hopefully this reduces the memory usage.
If not working, try to see if melt from data.table can convert the data more efficiently.
One other idea that could be helpful. Perhaps subset the Df by removing column 1 to 16 before the wide to long transformation, just keep an ID column. You can join column 1 to 16 back to the converted data frame later.
library(tidyverse)
Df_list <- list()
Institution <- unique(DF$Institution)
for (i in Institution){
Df_temp <- Df %>%
filter(Institution %in% i) %>%
pivot_longer(17:527, names_to = "G", values_to = "N")
Df_list[[i]] <- Df_temp
}
Df <- bind_rows(Df_list)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With