I'm trying to write code to find the 5 words on either side of a particular phrase. Easy enough, but I have to do this on a massive volume of data, thus the code needs to be optimal!
for file in listing:
file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
for line in file2:
linetrigrams = trigram_split(line)
for trigram in linetrigrams:
if trigram in trigrams:
line2 = line.replace(trigram,'###').split('###')
window = (line2[0].split()[-5:] + line2[1].split()[:5])
for item in window:
if item in mostfreq:
matrix[trigram][mostfreq[item]] += 1
Any suggestions for doing this much faster? It might be that I'm working with entirely the wrong data structures here. trigram_split() just gives all trigrams from the line (which are the units I need to create vectors for). 'Trigrams' is basically a list of about a million trigrams that I'm concerned with creating vectors for. Window gets the 5 words preceding and following the trigram (if that trigram is in the list), and then checks to see if they're in the list MostFreq (which is a dictionary of 1000 words as keys, each corresponding to an integer [0-100] as the stored value). This is then used to update the Matrix (which is a dictionary with lists ([0] * 1000) as the stored values). The corresponding value in the pseudo-matrix is incremented this way.
Several significant factors to consider when weighing various approaches:
We are the knights who say NI! NI NI NI NI NI NI NI NI and you search for NI what do you return? Will this happen to you?### is in your data?You could try a regular expression...
import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""
searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]
Alternate approach, which leads to worse results IMO...
def splitAndFind(text, phrase):
text2 = text.replace(phrase, "###").split("###")
if len(text2) > 1:
return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])
In iPython you can time it easily:
timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop
timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop
timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop
timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With