Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching if any keyword from a list is present in a string

Tags:

python

regex

I have a list of keywords. A sample is:

 ['IO', 'IO Combination','CPI Combos']

Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.

I wrote this Python code but it also identifies IO in competition:

if any(word.lower() in string_1.lower() for word in keyword_list):
                    print('FOUND A KEYWORD IN STRING')

I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?

like image 424
user2916886 Avatar asked Dec 05 '25 16:12

user2916886


1 Answers

Regex solution

You'll need to implement word boundaries here:

import re

keywords = ['IO', 'IO Combination','CPI Combos']

words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)

string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)

if match:
    print("Found: {}".format(match.group(0)))
else:
    print("Not found")

Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.


Even shorter with a direct comprehension:

rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))

Non-regex solution

Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like

found = any(word in keywords for word in string.split())

if found:
    # do sth. here

Notes

The latter has the drawback that strings like

there is a IO. competition coming in Summer 2018
#         ---^---

won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).

It really depends on your actual requirements.

like image 102
Jan Avatar answered Dec 08 '25 09:12

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!