I am running the following code:
str_extract_all("AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAA", ".{5}X.{5}")
but I only get one string back. However, if I rerun the same code with 4 elements each side, I get two strings as expected. So I understand the problem is the extracted strings will overlap on their sides (9 characters length between the "X"). This behaviour seems not to be documented in ?str_extract_all. Any suggestions how I can get all the strings, even if their ends overlap?
We can do that using positive lookahead since it does not consume the string when matched.
string <- "AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAA"
stringr::str_match_all(string, "(?=(.{5}X.{5}))")[[1]][, 2]
#[1] "AAAAAXAAAAA" "AAAAAXBAAAA"
We can get around this unfortunate feature as follows:
Let's give the ugly string a name, and find out the position of the X's
library(stringr)
aax <- "AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAAX"
x.mtrx <- str_locate_all(aax, "(?x) (?<=.{5}) X (?=.{5})")[[1]]
Since we're only passing one string, we only want the [[1]]
element of the result, which is a matrix. [Perl style lets me put space in my regex, which quickly becomes illegible otherwise.]
# R > x.mtrx
# start end
# [1,] 16 16
# [2,] 26 26
Split the matrix into single rows (of start + stop positions, which are the same for a single-character X.) Use that to extract the string from aax
.
split(x.mtrx, seq(nrow(x.mtrx))) %>%
map_chr(~ str_sub(aax, start = .x[1] - 5, end = .x[2] + 5) )
1 2
"AAAAAXAAAAA" "AAAAAXBAAAA"
Notice that the terminal X wasn't captured, because it didn't have 5 chars beyond it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With