I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the 
#first two [1,1] and last three [1,1,9] positions in the vector. 
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive. 
#They could be randomly selected from any position. 
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated. 
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
  # Check if event_perc makes sense
  if (event_perc > 1 | event_perc <= 0){
    stop("event_perc should be between 0 to 1.")
  }
  # Check if sample_n makes sense
  if (any(sample_n < 2)){
    stop("sample_n should be at least larger than 2")
  }
  # Determine the event numbers
  n <- round(length(x) * event_perc)
  # Determine the sample number of each event
  sample_vec <- sample(sample_n, size = n, replace = TRUE)
  names(sample_vec) <- paste0("S", 1:n)
  # Check if the sum of sample_vec is larger than the length of x
  # If yes, stop the function and print a message 
  if (length(x) < sum(sample_vec)){
    stop("Too many samples. Decrease event_perc or sampel_n")
  }
  # Determine the number that will not be merged
  n2 <- length(x) - sum(sample_vec) 
  # Create a vector with replicated 1 based on m
  non_merge_vec <- rep(1, n2)
  names(non_merge_vec) <- paste0("N", 1:n2)
  # Combine sample_vec and non_merge_vec, and then randomly sorted the vector
  combine_vec <- c(sample_vec, non_merge_vec)
  combine_vec2 <- sample(combine_vec, size = length(combine_vec))
  # Expand the vector
  expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
  expand_vec <- inverse.rle(expand_list)
  # Create a data frame with x and expand_vec
  dat <- data.frame(number = x, 
                    group = factor(expand_vec, levels = unique(expand_vec)))
  dat$index <- 1
  dat2 <- aggregate(cbind(dat$number, dat$index), 
                    by = list(group = dat$group),
                    FUN = sum)
  # # Convert dat2 to a matrix, remove the group column
  dat2$group <- NULL
  mat <- as.matrix(dat2)
  return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
#      number index
# [1,]      1     1
# [2,]      2     1
# [3,]      3     1
# [4,]      9     2
# [5,]      6     1
# [6,]      7     1
# [7,]     17     2
# [8,]     10     1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
  if (sum(events) > length(vec)) stop("Too many events to merge")
  # Create "groups" for the events
  merge_events_seq <- rep(seq_along(events), events) 
  # Create "groups" for the rest of the data
  remainder <- sequence((length(vec) - sum(events))) + length(events)
  # Combine both groups and shuffle them so that the 
  # positions being combined are not necessarily consecutive
  inds <- sample(c(merge_events_seq, remainder))
  # Aggregate using `data.table`
  temp <- data.table(values = vec, groups = inds)[
    , list(count = length(values), 
           total = sum(values),
           pos = toString(.I),
           original = toString(values)), groups][, groups := NULL]
  # Drop the other columns if required. Return the output.
  if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
ex.indices).ex.want).positions of the original values from the input vector.The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
##  [1]  4  7  1  2 11 14 18 19  1 10
merge_events <- c(2, 3)
fun(have, merge_events)
##    count total      pos   original
## 1:     1     4        1          4
## 2:     1     7        2          7
## 3:     2     2     3, 9       1, 1
## 4:     1     2        4          2
## 5:     3    40 5, 8, 10 11, 19, 10
## 6:     1    14        6         14
## 7:     1    18        7         18
fun(events = c(3, 4))
##    count total        pos     original
## 1:     4    39 1, 4, 6, 8 4, 2, 14, 19
## 2:     3    36    2, 5, 7    7, 11, 18
## 3:     1     1          3            1
## 4:     1     1          9            1
## 5:     1    10         10           10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
##  [1]  6 10 10  6 15 20 28 20 26 12 25 23  6 25  8 12 25 23 24  6
fun(input, events = c(4, 7, 2, 3))
##    count total                    pos                original
## 1:     7    92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2:     1    10                      2                      10
## 3:     3    71               6, 9, 14              20, 26, 25
## 4:     4    69          7, 12, 13, 16           28, 23, 6, 12
## 5:     2    45                  8, 17                  20, 25
## 6:     1    12                     10                      12
## 7:     1     8                     15                       8
## 8:     1    23                     18                      23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1]  6 10  6 15 25 24  6
sum(.Last.value)
## [1] 92
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With