I have a Python script that uses multiple regular expressions to search through a file's content. Here's the relevant code snippet:
PATTERNS = {  
    "email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"),  
    "phone": re.compile(r"\b\d{3}-\d{3}-\d{4}\b"),  
    "date": re.compile(r"\b\d{4}-\d{2}-\d{2}\b")  
}
def readfilecontent(filepath):
    with open(filepath,'r', encoding='utf-8', errors='ignore') as file:
        return file.read()
filecontent = readfilecontent("path/ToFile")
for key, pattern in PATTERNS.items():
    matches = pattern.findall(filecontent)
    if matches:
        for match in matches:
            print(match)
PATTERNS is a dictionary where the key is the pattern's name, and the value is a precompiled regex object (created using re.compile).
filecontent contains the file's content as a single string.
The script works correctly but takes several minutes to process each file due to the large size of the files and the number of patterns.
Is there a way to speed this up, either by restructuring the code or using a more efficient library or approach? For example, would combining patterns, parallel processing, or other methods improve performance?
Multiprocessing will likely not speed this up substantially since the bottleneck is likely file I/O. If the file is huge and you are reading it all into memory at once, your slow step might be the OS trying to find room for it all. It is usually faster to go line by line or piece by piece in these cases.
Given some massive file (ie, larger than what fits comfortably in memory) and multiple matches, you might consider using mmap and combining your regexs into one.
Example:
import re
import mmap
def regex_map_big_file(filename, pat):
    p=re.compile(pat.encode())
    with open(filename, "r") as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            for m in re.finditer(p, mm):
                yield next(k for k,v in m.groupdict().items() if v), m.group().decode()
pat=r"""(?x)
(?P<email>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b) |
(?P<phone>\b\d{3}-\d{3}-\d{4}\b) | 
(?P<date>\b\d{4}-\d{2}-\d{2}\b)"""  
# example:
for m in regex_map_big_file("big_file.txt", pat):
    print(m)
Given this file:
$ cat big_file.txt
balh blah [email protected] and his number is 415-515-3212 and last called him on 2025-12-25 
balh blah [email protected] and his number is 212-333-4444 
balh blah [email protected]  
Prints:
('email', '[email protected]')
('phone', '415-515-3212')
('date', '2025-12-25')
('email', '[email protected]')
('phone', '212-333-4444')
('email', '[email protected]')
Another approach, given that all your matches are single line in nature, you can also just loop over the file line by line:
def regex_map_big_file(filename, pat):    
    p=re.compile(pat)
    with open(filename, "r") as f:
        for line in f:
            for m in re.finditer(p, line):
                yield next((k,v) for k,v in m.groupdict().items() if v)
# same usage... 
Given this file:
$ tail file.txt
line 999995
line 999996
line 999997
line 999998
line 999999
line 1000000
line 1000001
balh blah [email protected] and his number is 415-515-3212 and last called him on 2025-12-25 
balh blah [email protected] and his number is 212-333-4444 
balh blah [email protected] 
(ie, 1 million lines then the short file above)
Each of these methods completes the file in <1 sec...
First start with instead of compiling 3 times try to combine your regex into one group.
COMBINED_PATTERN = re.compile(r"(?P<email>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b)|"
    r"(?P<phone>\b\d{3}-\d{3}-\d{4}\b)|"
    r"(?P<date>\b\d{4}-\d{2}-\d{2}\b)")
This way it should scan your file content once.
And also further optimizing your regex with something like https://github.com/intel/hyperscan might be a good idea as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With