Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching a list of words from a large file in python

Tags:

python

I am new python. I have a list of words and a very large file. I would like to delete the lines in the file that contain a word from the list of words.

The list of words is given as sorted and can be fed during initialization time. I am trying to find the best approach to solve this problem. I'm doing a linear search right now and it is taking too much time.

Any suggestions?

like image 264
user1524206 Avatar asked Dec 31 '25 19:12

user1524206


2 Answers

you can use intersection from set theory to check whether the list of words and words from a line have anything in common.

list_of_words=[]
sett=set(list_of_words)
with open(inputfile) as f1,open(outputfile,'w') as f2:
    for line in f1:
        if len(set(line.split()).intersection(sett))>=1:
            pass
        else:
            f2.write(line)
like image 107
Ashwini Chaudhary Avatar answered Jan 05 '26 23:01

Ashwini Chaudhary


If the source file contains only words separated by whitespace, you can use sets:

words = set(your_words_list)
for line in infile:
    if words.isdisjoint(line.split()):
        outfile.write(line)

Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar'] a line like foo, bar,stuff won't be removed. To handle this, you need regular expressions:

rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
    if not re.search(rr, line):
        outfile.write(line)
like image 30
georg Avatar answered Jan 06 '26 01:01

georg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!