I have the code below to find the frequencies of two word phrases. I need to do the same for three word phrases.
However the code below does not seem to work for 3 word phrases.
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
You can use collections.Counter
on an iterable of 3-word groupings. The latter is constructed via a generator comprehension and list slicing.
from collections import Counter
three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}
print(wordscount)
{'show makes me': 2}
Notice we don't use str.join
until the very end to avoid unnecessary repeated string operations. In addition, tuple
conversion is required for Counter
as dict
keys must be hashable.
I suggest factoring the functionality out to a seperate function:
def nwise(iterable, n):
"""
Iterate over n-grams of an iterable.
Has a bit of an overhead compared to pairwise (although only during
initialization), so the two functions are implemented independently.
"""
iterables = [iter(iterable) for _ in range(n)]
for index, it in enumerate(iterables):
for _ in range(index):
next(it)
yield from zip(*iterables)
Then you can do
two_words = [" ".join(bigram) for bigram in nwise(words, 2))]
and
three_words = [" ".join(trigram) for trigram in nwise(words, 3))]
and so on.
You can then use collections.Counter
on top of that:
three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With