Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match group only if it's different than other matching group

Tags:

python

regex

I want to match every substring that begins with w and ends d with regex.

For example for input worldworld it should return ('worldworld', 'world', 'world'). (note: there are two world but they are different because they are at different position in the string)

For this purpose I ended with this program with following regex:

import re

s = '''worldworld'''

for g in re.finditer(r'(?=(w.*d))(?=(w.*?d))', s):
    print(g.start(1), g.end(1), g[1])
    print(g.start(2), g.end(2), g[2])
    print('-' * 40)

This prints:

0 10 worldworld
0 5 world
----------------------------------------
5 10 world
5 10 world
----------------------------------------

It finds all substrings, but some are duplicates also (notice the starting and ending position of the group).

I can filter the groups afterwards with group's starting and ending position, but I'm wondering if it can be done with change to my regex, to only return unique groups.

Can I change this regex to only match group that is different from other? If yes how? I'm open to suggestions how to solve this problem.

like image 603
Andrej Kesely Avatar asked Sep 20 '25 11:09

Andrej Kesely


2 Answers

I don't believe this can be done with a single regexp. But it's straightforward with a nested loop:

import re
test = "wddddd"
# need to compile the tail regexp to get a version of
# `finditer` that allows specifying a start index
tailre = re.compile("(d)")
for wg in re.finditer("(w)", test):
    start = wg.start(1)
    for dg in tailre.finditer(test, wg.end(1)):
        end = dg.end(1)
        print(test[start : end], "at", (start, end))

That displays:

wd at (0, 2)
wdd at (0, 3)
wddd at (0, 4)
wdddd at (0, 5)
wddddd at (0, 6)

With

test = "worldworldworld"

instead:

world at (0, 5)
worldworld at (0, 10)
worldworldworld at (0, 15)
world at (5, 10)
worldworld at (5, 15)
world at (10, 15)
like image 58
Tim Peters Avatar answered Sep 22 '25 01:09

Tim Peters


One option would be, with the lazy second group, to positive lookahead for .*d (greedy) afterwards to ensure that if the lazy second group matches, it's not the same as the greedy first group:

(?=(w.*d))(?:(?=(w.*?d)(?=.*d)))?

https://regex101.com/r/UI9ds7/2

like image 39
CertainPerformance Avatar answered Sep 21 '25 23:09

CertainPerformance