Is there a way to remove duplicate and continuous words/phrases in a string? E.g.
[in]: foo foo bar bar foo bar
[out]: foo bar foo bar
I have tried this:
>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'
What happens when it gets a little more complicated and i want to remove phrases (let's say phrases can be made up of up to 5 words)? how can it be done? E.g.
[in]: foo bar foo bar foo bar
[out]: foo bar
Another example:
[in]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .
[out]: this is a sentence where phrases duplicate . sentence are not prhases .
We create an empty hash table. Then split given string around spaces. For every word, we first check if it is in hash table or not. If not found in hash table, we print it and store in the hash table.
You can use re module for that.
>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'
If you want to match any number of consecutive occurrences:
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'    
Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:
>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'
I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:
>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
So let's extend that a little with a function that returns a list with its duplicated repeated values removed:
from itertools import chain, groupby
def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))
That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:
def stride(lst, offset, length):
    if offset:
        yield lst[:offset]
    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return
Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:
def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))
    return list_of_words
Putting it all together:
from itertools import chain, groupby
def stride(lst, offset, length):
    if offset:
        yield lst[:offset]
    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return
def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))
def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))
    return list_of_words
a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'
b = 'this is a sentence where phrases duplicate . sentence are not prhases .'
print ' '.join(cleanse(a.split(), 5)) == b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With