I need to split a text into words. I tried:
strsplit("Hello World.", split = "\\b", perl = T)
and got:
[[1]]
[1] "H" "e" "l" "l" "o" " " "W" "o" "r" "l" "d" "."
Because I used perl = TRUE, I expected that it would split at every word, and not every letter.
A word boundary \b is a zero-width assertion that matches at the position between a word character and a non-word character, but it doesn't consume any characters itself. So it returns empty matches (source: https://www.regular-expressions.info/wordboundaries.html).
> stringr::str_view("Hello World.", "\\b")
[1] │ <>Hello<> <>World<>.
?strsplit says about the arg split
character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.*
so it will treat zero-width matches as ''.
To cite Hadley about \\b
When used alone, anchors will produce a zero-width match:
stringr::str_view("abc", c("$", "^", "\\b")) [1] │ abc<> [2] │ <>abc [3] │ <>abc<>
To be more specific, let's look at the code. strsplit calls .Internal(strsplit("Hello World.", as.character("\\b"), fixed=FALSE, perl=TRUE, useBytes=FALSE)) which you can find here and it does exactly what the vignette said:
In the PCRE-pearl section, look at these critical lines:
/* Empty matches get the next char, so move by one. */
if (ovector[1] > 0)
bufp += ovector[1];
else if (*bufp)
bufp += utf8clen(*bufp); // <-- THIS IS THE BUG!
And later in the splitting logic:
if (ovector[1] > 0) {
/* Match was non-empty. */
if (ovector[0] > 0)
strncpy(pt, bufp, ovector[0]);
pt[ovector[0]] = '\0';
bufp += ovector[1];
} else {
/* Match was empty. */
int clen = utf8clen(*bufp);
strncpy(pt, bufp, clen); // <-- EXTRACTS SINGLE CHARACTER
pt[clen] = '\0';
bufp += clen; // <-- ADVANCES BY ONE CHARACTER
}
The algorithm assumes that zero-length matches mean "split here and take the next character as a token. I assume that sln's pattern
> gregexpr("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", "Hello World.", perl = TRUE)
[[1]]
[1] 6 7 12
attr(,"match.length")
[1] 0 0 0
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
which matches zero-length positions except the first (1) works because at all these positions you can split there and end up with a lower / higher string, whereas if you split at 1
> gregexpr("\\b", "Hello World.", perl = TRUE)
[[1]]
[1] 1 6 7 12
attr(,"match.length")
[1] 0 0 0 0
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
there is no left part (it becomes zero length) which I guess then let's the function start splitting every character.
To achieve your goal, you could split on one or more whitespace \\s+ or non-word characters \\W+ like
strsplit("Hello World.", split = "\\s+")[[1]]
[1] "Hello" "World."
strsplit("Hello World.", split = "\\W+")[[1]]
[1] "Hello" "World"
G. Grothendiek answered in the comment. I will include it here in case this comment gets deleted
Try
library(stringi) stri_split_boundaries("Hello World!", type = "word", skip_word_none = TRUE)giving
[[1]] [1] "Hello" "World"
Probably you can try
> strsplit("Hello World.", split = "(?<=\\W)\\b\\w+|(?<=\\w\\b)\\W", perl = T)
[[1]]
[1] "Hello" "World"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With