Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Java String split greedy with lookahead?

Tags:

java

regex

split

Code is basically:

String[] result = "T&&T&T".split("(?=\\w|&+)");

I was expecting the lookahead to be greedy but instead it is returning the array:

T, &, &, T, &, T

What I am aiming for is:

T, &&, T, &, T

Is this possible for split and lookahead?

I have tried the following split regex values but the result is still not greedy for the ampersand:

"(?=\\w|&&?)"

"(?=\\w|&{1,2})"

like image 289
radj Avatar asked Nov 15 '25 06:11

radj


2 Answers

It is already greedy, but I think you are misunderstanding how your split is working. The problem is that you are thinking of the characters but not the space between them (this is one of the places where regexes can get away from you).

You are asking to split at the places in the string where the next character is either a word character or a series of ampersands. In your string, let's mark the places that satisfy that:

T|&|&|T|&|T

In the space between the first T and the first ampersand, the next character is an ampersand (matches (?=&) which is valid in your regex), the space between the two ampersands also matches for this same reason. The space between the ampersands and the second T also matches (matches (?=\w)), and so on.

The split function will test each space in the string to determine if it is a candidate for a split position. To do what you want, you have to be careful about using the lookahead, so that we don't allow allow splits in the middle of a string of ampersands.

There are multiple ways you may overcome this; Wiktor Stribiżew provides a suggestion that works in his comment.

Usually using a look-behind to check that you are not repeating an undesired character will work, or if possible you can use a look-behind to identify the matching places, and a look-ahead to avoid the undesired repetitions. For example, if we wish to split at all characters keeping repeated characters together, you could do (?<=(.))(?!\\1) which splits your example as T, &&, T, &, T.

like image 151
Matthew Avatar answered Nov 17 '25 22:11

Matthew


Lookarounds cannot be greedy or reluctant, they just check if the adjoining text to the left (lookbehind) and to the right (lookahead) matches the lookaround subpattern. If there is a match, and the lookaround is positive, the empty location is matched. If the lookaround is not anchored, each location in string is tested against the pattern in the lookaround, even the beginning and end. See this screenshot showing that (with your (?=\w|&&?)):

enter image description here

Since the lookaround is a zero-width assertion and it does not consume characters, all locations (before each character and at the end) are tested. Thus, you get matches between each character.

The (?=\w|&&?) checks the first location before T: it gets matched with \w, so this location is matched (see the first |). Then comes the next location, after the first T before the &. It is matched as it is followed woth &&. Then the regex engine goes on to check the location after the first & and the second &. It is matched as there is a & after it. This way, we match up to the end. The end location is not matched as it is not followed with & or a word character.

You may restrict the pattern inside a lookaround with another lookaround to avoid matching specific locations inside the input string.

(?=\w|(?<!&)&)
      ^^^^^^

The (?<!&)& pattern will match a & that is not preceded with another &. See the regex demo.

IDEONE demo:

String[] result = "T&&T&T".split("(?=\\w|(?<!&)&)");
System.out.println(Arrays.toString(result));
// => [T, &&, T, &, T]

The lookaround solution is a generic one. If we are to consider the current case, you can surely "shorten" the pattern to \b (which will also find a match at the end of the string, though Java String#split will safely remove trailing empty elements from the resulting array) that matches all locations between a non-word and word characters and also at the start/end of the string if there is a word character at its start/end. This won't work if the alternatives (like \w and & in your regex) belong to the same type (say, both are word characters.

like image 34
Wiktor Stribiżew Avatar answered Nov 17 '25 21:11

Wiktor Stribiżew