Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a better way to match multiple substrings in a string than to search for each substring in a loop?

I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)

which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"].

However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".

This is why I tried to solve this problem with regular expressions, like this:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([
    x.upper() for x in value
    if regex.search(
        r"\b" + regex.escape(x) + r"\b", str, regex.IGNORECASE
    ) is not None
])
print(temp)

which results in ["TEST", "MATCH", "MULTIPLE"], the correct solution.

Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.

Is there a way to either make the first solution work, or the second solution to run faster?

value can also contain numbers, or a short phrase like "test1 test2".

like image 994
jv3768 Avatar asked Oct 22 '25 05:10

jv3768


1 Answers

It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:

  • Generate a single pattern matching all values. This way you would only need to search the string once (instead of once per value).
  • Skip escaping values unless they contain special characters (like '^' or '*').
  • Create the resulting list using a list comprehension instead of calling list.extend() repeatedly.
# 'str' is a built-in function, so use another name instead
string = 'A Test test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(re.escape, values)))

# unique matches, uppercased
matches = set(map(str.upper, re.findall(pattern, string, regex.IGNORECASE)))

# arrange the results as they appear in `values`
matched_values = [v for value in values if (v := value.upper()) in matches]
print(matched_values)  # ['TEST', 'MULTIPLE', 'MATCH']
like image 199
Eugene Yarmash Avatar answered Oct 23 '25 18:10

Eugene Yarmash



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!