Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is happening when R combines two vectors using c()?

Tags:

r

When you concatenate two vectors in R using c(), it "combines the arguments and results in a vector". Does it combine them by creating a new vector to take on the elements of the two vectors or is there a way to literally combine the data spaces allocated for both vectors?

When I searched I couldn't find an explanation. The visual representations of c() all literally just attach the second vector to the end of the first vector, but I think that's just so we can easily understand what this function does and not what actually happens.

like image 647
vina Avatar asked Dec 06 '25 06:12

vina


1 Answers

When you call c(), a new vector is allocated, into which the existing vectors are combined. It happens here in the underlying C code.

PROTECT(ans = allocVector(mode, data.ans_length));

This might seem wasteful, since we already have the values written to memory, so why not just wrap up a couple of pointers to this memory and call that a vector?

There are several reasons for this.

Firstly, many of the arithmetic and statistical operations that R carries out on vectors are done by iterating through elements in contiguous memory. This would not be possible if the elements were not in contiguous memory. There would be a lot of address-checking steps and jumping between memory addresses, which would make things a lot slower. Outside of R, concatenating vectors in C or in C++ is also done by allocating a new vector, for much the same reason.

A second reason is to avoid fragmentation and memory leaks. If we created a vector from concatenating subsets of other vectors without allocating dedicated memory, we would end up with a bunch of pointers to different locations in the memory free store. If we then used subsets of this vector, we would have a nightmare of memory pointers to memory pointers to fragments of vectors, and chunks of unused fragments of vectors which could not be re-used or reclaimed by the garbage collector.

A third reason is that R users expect copy-on-modify behaviour. For example, if we have:

a <- c(1, 2, 3)

b <- c(a, a)

b
#> [1] 1 2 3 1 2 3

Then we expect to be able to change a single element:

b[6] <- 6

b
#> [1] 1 2 3 1 2 6

Whereas, if b did not have its own data allocated, this operation would change the third element of b as well as the sixth element.

As Nicola points out in the comments, another reason is that c will carry out type checking and implicit conversion between types to ensure that the underlying storage mode of the new vector is consistent. This allows some straightforward and well-defined flexibility between integers, doubles, logical vectors, factors and character strings which would be impossible if vectors created by c were composed of pieces of existing vectors.


Conceptually, the memory allocation in R works like this: each R object is stored in C as a SEXP object. This is a structure which is basically a pointer to the data itself, which is stored in memory as a structure called a SEXPREC.

Therefore, if we run the code:

A <- 1:4
B <- 5:14

the vectors A and B might be stored in memory like this:

enter image description here

If we then do

C <- c(A, B)

Then in memory we get:

enter image description here

With the data in the SEXPREC pointed to by C having been copied from the data in the two other SEXPREC objects pointed to by A and B

like image 119
Allan Cameron Avatar answered Dec 08 '25 21:12

Allan Cameron