Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group strings with values in Python

I'm working on twitter hashtags and I've already counted the number of times they appear in my csv file. My csv file look like:

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

Now, I would like to group together 2 terms that are close, such as "GilletsJaunes" and "gilletsjaune" using the fuzzywuzzy library. If the proximity between the 2 terms is greater than 80, then their value is added in only one of the 2 terms and the other is deleted. This would give:

GilletsJaunes, 120
Macron, 50
tax, 10

For use "fuzzywuzzy":

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output
like image 907
Steph Avatar asked May 25 '26 10:05

Steph


1 Answers

First, copy these two functions to be able to compute the argmax:

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

Second, load the content of your CSV into a Python dictionary and proceed as follows:

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes': 120, 'Macron': 50, 'tax': 10}

like image 55
Wok Avatar answered May 27 '26 23:05

Wok



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!