Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex re.findall() hangs - What if you cant read line by line

I have multiple files, each of which I am searching for a sequence of words.

My regex expression basically searches for a sequence where word1 is followed by word2 followed by word 3 etc.. So the expression looks like:

strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)

For files below 20kb, the expression executes pretty well. However, the execution time exponentially increases for files > 20 kb and the process completely hangs for files close to 100kb. It appears (after having read previous threads) that the problem is to do with using .* in conjunction with re.DOTALL - leading to "catastrophic backtracking". The recommended solution was to provide the input file line by line instead of reading the whole file into a single memory buffer.

However, my input file is filled with random whitespace and "\n" newline characters. My word sequence is also long and occurs over multiple lines. Therefore, I need to input the whole file together into the regex expression in conjunction with re.DOTALL - otherwise a line by line search will never find my sequence.

Is there any way around it?

like image 283
shoi Avatar asked Jan 23 '26 23:01

shoi


1 Answers

If you're literally searching for the occurrence of three words, with no regex patterns in them at all, there's no need to use regexes at all – as suggested by @Bart as I wrote this answer :). Something like this might work (untested, and can probably be prettier):

with open('...') as f:
    contents = f.read()

words = ['word1', 'word2', 'word3']
matches = []
start_idx = 0
try:
    while True:
        cand = []
        for word in words:
            word_idx = contents.index(word, start_idx)
            cand.append(word_idx)
            start_idx = word_idx + len(word)
        matches.append(cand)
except ValueError:  # from index() failing
    pass

This puts the indices in matches; if you want an equivalent result to findall, you could do, say,

found = [contents[match[0]:match[-1]+len(words[-1]] for match in matches]

You could also make this kind of approach work without reading the whole file in beforehand by replacing the call to index with an equivalent function on files. I don't think the stdlib includes such a function; you'd probably have to manually use readline() and tell() or similar methods on file objects.

like image 182
Danica Avatar answered Jan 26 '26 13:01

Danica