Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract unique letters among word of consecutive letters?

Tags:

string

r

unique

For example, there is character x = "AAATTTGGAA".

What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".

Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.

How should I get this?

like image 362
Park Avatar asked Oct 26 '25 15:10

Park


2 Answers

Here is a useful regex trick approach:

x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out

[1] "AAA" "TTT" "GG"  "AA"

The regex pattern used here says to split at any boundary where the preceding and following characters are different.

(?<=(.))  lookbehind and also capture preceding character in \1
(?!\\1)   then lookahead and assert that following character is different
like image 161
Tim Biegeleisen Avatar answered Oct 29 '25 06:10

Tim Biegeleisen


You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.

x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))

rle(vec)$values
#[1] "A" "T" "G" "A"

paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"
like image 44
Ronak Shah Avatar answered Oct 29 '25 05:10

Ronak Shah