I'm struggling with some performance complications.
The task in hand is to extract the similarity value between two strings. For this I am using fuzzywuzzy
:
from fuzzywuzzy import fuzz
print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38
However, this is OK. The problem that I'm facing is that I have two lists, one has 1500 rows and the other several thousand. I need to compare all elements of the first agains all elements of the second one. Simple for in a for loop will take ridiculously big amount of time to compute.
If anyone has a suggestion how can I speed this up, it would be highly appreciated.
I've made something on my own for you (python 2.7):
from __future__ import division
import time
from itertools import izip
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += len(word)
if counter:
return counter/len(' '.join(s_bigger))*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
So I think mine is faster, but more accurate for you? You decide.
EDIT Now is not only faster, but also more accurate.
Result:
match: 41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match: 50
fuzzy: --- 7.39097595215e-05 seconds ---
<simliar or similar>/<length of bigger>*100%
41.1764705882
Equals?
True
Faster than fuzzy?
True
Of course if you would have words check like fuzzywuzzy does, then here you go:
from __future__ import division
from itertools import izip
import time
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += 1
if counter:
return counter/len(s_bigger)*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
Result:
match: 50.0
compare: --- 7.20024108887e-05 seconds ---
match: 50
fuzzy: --- 0.000125169754028 seconds ---
Equals?
True
Faster than fuzzy?
True
If you need to count the number of times each of the statements appear then no, there is no way I know of to get a huge speedup over the n^2 operations needed to compare the elements in each list. You might be able to avoid some string matching by using the length to rule out the possibility that a match could occur but you still have nested for loops. You would probably spend far more time optimizing it than the amount of processing time it would save you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With