I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str()
of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class
of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the str
ucture of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame
:
A data frame is first coerced to a matrix: see
as.matrix
.
OK, see ?as.matrix
:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of t
ranspose. Then you coerce the matrix to data frame, where all columns are character (or factor
, depending on your stringsAsFactors
setting) - check str(data.frame(d_t))
.
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer
, numeric
, and character
vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer
vectors occupy 4 bytes per element, and numeric
and character
vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With