Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - find occurrences of list of strings within string

I have a large string and a list of search strings and want to build a boolean list indicating whether or not each of the search strings exists in the large string. What is the fastest way to do this in Python?

Below is a toy example using a naive approach, but I think it's likely there's a more efficient way of doing this.

e.g. the example below should return [1, 1, 0] since both "hello" and "world" exist in the test string.

def check_strings(search_list, input):
output = []
for s in search_list:
    if input.find(s) > -1:
        output.append(1)
    else:
        output.append(0)
return output

search_strings = ["hello", "world", "goodbye"] test_string = "hello world" print(check_strings(search_strings, test_string))

like image 588
Danny Friar Avatar asked Dec 07 '25 08:12

Danny Friar


2 Answers

I can't say if this is the fastest, (this is still O(n*m)), but this is the way I would do it:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]

The following program might be faster, or not. It uses a regular expression to make one pass through the input string. Note that you may you may want to use re.escape(i) in the re.findall() expression, or not, depending upon your needs.

def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]

Here is a complete test program:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]


def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]


search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
assert check_strings(search_strings, test_string) == [True, True, False]
assert check_strings_re(search_strings, test_string) == [True, True, False]
like image 54
Robᵩ Avatar answered Dec 08 '25 20:12

Robᵩ


An implementation using the Aho Corasick algorithm (https://pypi.python.org/pypi/pyahocorasick/), which uses a single pass through the string:

import ahocorasick
import numpy as np

def check_strings(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
print(check_strings(search_strings, test_string))
like image 42
Danny Friar Avatar answered Dec 08 '25 20:12

Danny Friar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!