Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why strsplit(x, split = "\\b", perl = TRUE) splits at every letter instead of every word?

Tags:

regex

r

I need to split a text into words. I tried:

strsplit("Hello World.", split = "\\b", perl = T)

and got:

[[1]]
 [1] "H" "e" "l" "l" "o" " " "W" "o" "r" "l" "d" "."

Because I used perl = TRUE, I expected that it would split at every word, and not every letter.

like image 854
Fabio Correa Avatar asked Oct 15 '25 16:10

Fabio Correa


2 Answers

A word boundary \b is a zero-width assertion that matches at the position between a word character and a non-word character, but it doesn't consume any characters itself. So it returns empty matches (source: https://www.regular-expressions.info/wordboundaries.html).

> stringr::str_view("Hello  World.", "\\b")
[1] │ <>Hello<>  <>World<>.

?strsplit says about the arg split

character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.*

so it will treat zero-width matches as ''.

To cite Hadley about \\b

When used alone, anchors will produce a zero-width match:

stringr::str_view("abc", c("$", "^", "\\b"))
[1] │ abc<>
[2] │ <>abc
[3] │ <>abc<>

To be more specific, let's look at the code. strsplit calls .Internal(strsplit("Hello World.", as.character("\\b"), fixed=FALSE, perl=TRUE, useBytes=FALSE)) which you can find here and it does exactly what the vignette said: In the PCRE-pearl section, look at these critical lines:

/* Empty matches get the next char, so move by one. */
if (ovector[1] > 0)
    bufp += ovector[1];
else if (*bufp)
    bufp += utf8clen(*bufp);  // <-- THIS IS THE BUG!

And later in the splitting logic:

if (ovector[1] > 0) {
    /* Match was non-empty. */
    if (ovector[0] > 0)
        strncpy(pt, bufp, ovector[0]);
    pt[ovector[0]] = '\0';
    bufp += ovector[1];
} else {
    /* Match was empty. */
    int clen = utf8clen(*bufp);
    strncpy(pt, bufp, clen);    // <-- EXTRACTS SINGLE CHARACTER
    pt[clen] = '\0';
    bufp += clen;               // <-- ADVANCES BY ONE CHARACTER
}

The algorithm assumes that zero-length matches mean "split here and take the next character as a token. I assume that sln's pattern

> gregexpr("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", "Hello World.", perl = TRUE)
[[1]]
[1]  6  7 12
attr(,"match.length")
[1] 0 0 0
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

which matches zero-length positions except the first (1) works because at all these positions you can split there and end up with a lower / higher string, whereas if you split at 1

> gregexpr("\\b", "Hello World.", perl = TRUE)
[[1]]
[1]  1  6  7 12
attr(,"match.length")
[1] 0 0 0 0
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

there is no left part (it becomes zero length) which I guess then let's the function start splitting every character.

Solutions for your problem

To achieve your goal, you could split on one or more whitespace \\s+ or non-word characters \\W+ like

strsplit("Hello  World.", split = "\\s+")[[1]]
[1] "Hello"  "World."
strsplit("Hello World.", split = "\\W+")[[1]]
[1] "Hello" "World"

G. Grothendiek answered in the comment. I will include it here in case this comment gets deleted

Try

library(stringi)
stri_split_boundaries("Hello World!", type = "word", skip_word_none = TRUE)

giving

[[1]]
[1] "Hello" "World"
like image 118
Tim G Avatar answered Oct 18 '25 08:10

Tim G


Probably you can try

> strsplit("Hello World.", split = "(?<=\\W)\\b\\w+|(?<=\\w\\b)\\W", perl = T)
[[1]]
[1] "Hello" "World"
like image 27
ThomasIsCoding Avatar answered Oct 18 '25 06:10

ThomasIsCoding