Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string of characters contained in a row of a data frame by a fixed number of characters and store the resultant fragments in subsequent rows

I have the following data frame:

df <- data.frame(V1 = c(">A1_[Er]", 
                        "aaaabbbcccc", 
                        ">B2_[Br]", 
                        "ddddeeeeeff", 
                        ">C3_[Gh]", 
                        "ggggggghhhhhiiiiijjjjjj"))

I want to split the strings by the fixed number of characters (two for the purpose of this particular question) and place them in new rows. I also want to exclude the rows containing strings starting with ">" sign. The resultant data frame should look like this:

df1 <- data.frame(V1 = c(">A1_[Er]", "aa", "aa", "bb", "bc", "cc", "c", 
                         ">B2_[Br]", "dd", "dd", "ee", "ee", "ef", "f",
                         ">C3_[Gh]", "gg", "gg", "gg", "gh", "hh", "hh", "ii", "ii", "ij", "jj", "jj", "jj"))

I have tried using separate_longer_position() function on a subseted df like this:

separate_longer_position(subset(df, !df$V1 %like% ">"), V1, 2)

My approach did indeed chop up the desired strings, but also left the rows containing the strings starting with ">" out from the resultant data frame.

On a side note, this is indeed a FASTA format, but for educationl purposes, I dont want to use dedicated packages like Biostrings to solve this.

Please advise.

like image 726
Traitor Legions Avatar asked Sep 03 '25 16:09

Traitor Legions


1 Answers

You can try regmatches

df1 <-
  data.frame(V1 = with(
    df,
    unlist(
      lapply(
        V1,
        function(x) {
          if (startsWith(x, ">")) {
            x
          } else {
            regmatches(x, gregexpr("\\w{1,2}", x))
          }
        }
      )
    )
  ))

and obtain

> df1
         V1
1  >A1_[Er]
2        aa
3        aa
4        bb
5        bc
6        cc
7         c
8  >B2_[Br]
9        dd
10       dd
11       ee
12       ee
13       ef
14        f
15 >C3_[Gh]
16       gg
17       gg
18       gg
19       gh
20       hh
21       hh
22       ii
23       ii
24       ij
25       jj
26       jj
27        j
like image 112
ThomasIsCoding Avatar answered Sep 05 '25 07:09

ThomasIsCoding