Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a R function to pair elements from two columns of a data.frame to make a new data.frame

Tags:

r

tidyr

I have several thousand rows like this stored in a object in R. I want to make a new df2 by pairing elements from two columns ($exonStarts and $exonEnds) from the df below

df <- structure(list(bin = c(0L, 0L, 0L, 0L, 0L, 0L), name = c("XM_011541469.2", 
"XM_017001276.2", "XM_011541467.2", "NM_001276352.2", "NM_001276351.2", 
"XM_011541465.3"), chr = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1"), strand = c("-", "-", "-", "-", "-", "-"), txStart = c(67092164L, 
67092164L, 67092164L, 67092164L, 67092164L, 67092164L), txEnd = c(67109072L, 
67131227L, 67131227L, 67134970L, 67134970L, 67134970L), cdsStart = c(67093004L, 
67093004L, 67093004L, 67093579L, 67093004L, 67093004L), cdsEnd = c(67103382L, 
67127240L, 67127240L, 67127240L, 67127240L, 67127240L), exonCount = c(5L, 
9L, 9L, 9L, 8L, 9L), exonStarts = c("67092164,67095234,67096251,67103237,67109028,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,", 
"67092164,67096251,67103237,67111576,67115351,67125751,67127165,67131141,67134929,", 
"67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67134929,"
), exonEnds = c("67093604,67095421,67096321,67103382,67109072,", 
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67131227,", 
"67093604,67095421,67096321,67103343,67111644,67115464,67125909,67127257,67131227,", 
"67093604,67096321,67103382,67111644,67115464,67125909,67127257,67131227,67134970,", 
"67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,", 
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67134970,"
), score = c(0L, 0L, 0L, 0L, 0L, 0L), gene = c("C1orf141", "C1orf141", 
"C1orf141", "C1orf141", "C1orf141", "C1orf141"), cdsStartStat = c("cmpl", 
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), cdsEndStat = c("cmpl", 
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), exonFrames = c("0,2,1,0,-1,", 
"0,2,1,0,1,2,0,0,-1,", "0,2,1,0,1,2,0,0,-1,", "2,1,0,1,2,0,0,-1,-1,", 
"0,2,1,2,0,0,-1,-1,", "0,2,1,0,1,2,0,0,-1,")), row.names = c(NA, 
6L), class = "data.frame")
 

to produce an output like below. The first row of the input df should produce 5 rows of df2 in the format below by paring each element of $exonStart with each element of $exonEnd (the element counts will identical and is present in the column $exonCount).

$chr    $exonStart  $exonEnd
chr1    67092164    67093604
chr1    67095234    67095421
chr1    67096251    67096321
chr1    67103237    67103382
chr1    67109028    67109072

This needs to iterate over all rows of input df and all results combined into df2. So, in total there would be sum(object$exonCount) rows in df2.

I know that there would be some sort of strsplit function in here to split $exonStarts and $exonEnds and then combine them. Maybe an apply function to do this over all rows? I tried something like this first

map2(unlist(strsplit(df$exonStarts[1], ",")), unlist(strsplit(dft$exonEnds[1], ",")), c)

to produced a list of paired elements

[[1]]
[1] "67092164" "67093604"

[[2]]
[1] "67095234" "67095421"

[[3]]
[1] "67096251" "67096321"

[[4]]
[1] "67103237" "67103382"

[[5]]
[1] "67109028" "67109072"

But I do not know where to go from here. Any help would be appreciated.

like image 297
user1675107 Avatar asked Sep 01 '25 20:09

user1675107


2 Answers

You could use separate_longer_delim from tidyr

library(tidyr)
library(dplyr)

select(df, chr, exonStarts, exonEnds) %>%
  separate_longer_delim(c(exonStarts, exonEnds), delim=",")

Which gives (showing first 10 rows only):

    chr exonStarts exonEnds
1  chr1   67092164 67093604
2  chr1   67095234 67095421
3  chr1   67096251 67096321
4  chr1   67103237 67103382
5  chr1   67109028 67109072
6  chr1                    
7  chr1   67092164 67093604
8  chr1   67095234 67095421
9  chr1   67096251 67096321
10 chr1   67103237 67103382

The blank rows are due to the trailing commas in the original columns, which can easily be removed using filter. If you do that, you will see that there are 49 rows remaining, the same as sum(df$exonCount).


As @TarJae mentions, the number of items in each column must be equal, or an error will result. Presumably you are confident that this situation isn't possible.

df$exonEnds[1] <- "67093604,67095421,67096321,67103382,"

select(df, chr, exonStarts, exonEnds) %>%
  separate_longer_delim(c(exonStarts, exonEnds), delim=",")

#Error in `separate_longer_delim()`:
#! In row 1, can't recycle input of size 6 to size 5.
#Run `rlang::last_trace()` to see where the error occurred.

You'll also get an error using your strsplit approach.

like image 93
Edward Avatar answered Sep 04 '25 00:09

Edward


In base R, we can do

f = \(chr, split, ...) {
  stopifnot(is.character(chr))
  l = strsplit(chr, split, ...)
  # caution, this pads NA's to the end, if shorter than max.
  l = lapply(l, `length<-`, max(lengths(l))) 
  unlist(l, recursive = FALSE, use.names = FALSE)
}

(For other applications we need to modify the last line of f().)

Application:

> lapply(df[c("exonStarts", "exonEnds")], f, split = ",") |> 
+   list2DF() |>
+   head()
  exonStarts exonEnds
1   67092164 67093604
2   67095234 67095421
3   67096251 67096321
4   67103237 67103382
5   67109028 67109072
6       <NA>     <NA>

Caution. This approach has the disadvantages of padding NAs to the end of a ","-splitted row of exonStarts or exonEnds respectively if it's not of maximum length. Put differently, it underlies the strong assumption that missing data, if present, is at the ends of each row. As long as you do not carry data containing such information this approach might be appropriate.


Compact:

f = \(chr, split, ...) unlist(lapply(l <- strsplit(chr, split, ...), `length<-`, max(lengths(l))))
list2DF(lapply(df[c("exonStarts", "exonEnds")], f, split = ","))
like image 22
Friede Avatar answered Sep 03 '25 23:09

Friede