Python

Question

I'm trying to write code to find the 5 words on either side of a particular phrase. Easy enough, but I have to do this on a massive volume of data, thus the code needs to be optimal!

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

Any suggestions for doing this much faster? It might be that I'm working with entirely the wrong data structures here. trigram_split() just gives all trigrams from the line (which are the units I need to create vectors for). 'Trigrams' is basically a list of about a million trigrams that I'm concerned with creating vectors for. Window gets the 5 words preceding and following the trigram (if that trigram is in the list), and then checks to see if they're in the list MostFreq (which is a dictionary of 1000 words as keys, each corresponding to an integer [0-100] as the stored value). This is then used to update the Matrix (which is a dictionary with lists ([0] * 1000) as the stored values). The corresponding value in the pseudo-matrix is incremented this way.

chmullig · Accepted Answer

Several significant factors to consider when weighing various approaches:

Multiline vs single line
Length of the lines
Length of the search pattern
Rate of the search matching
What to do if there are fewer than 5 words before/after
How to handle non-word, non-space characters (newlines and punctuation)
Case insensitive?
What to do with overlapping matches? eg If the text is We are the knights who say NI! NI NI NI NI NI NI NI NI and you search for NI what do you return? Will this happen to you?
What if ### is in your data?
Would you rather miss some, or return extra wrong results? There may be some tradeoffs, particularly with messy real world data.

You could try a regular expression...

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

Alternate approach, which leads to worse results IMO...

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

In iPython you can time it easily:

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop

Python - Optimal code to find preceding and following five words from a given point in a line

Tags:

string

search

Georgina

1 Answers

chmullig

Recent Activity

Donate For Us