I am trying to remove words of length less than 4 from a string.
I use this regex:
re.sub(' \w{1,3} ', ' ', c)
Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:
I am in a bank.
It gives me:
I in bank.
How to resolve this?
Don't include the spaces; use \b word boundary anchors instead:
re.sub(r'\b\w{1,3}\b', '', c)
This removes words of up to 3 characters entirely:
>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown jumps over lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
' bank.'
If you want an alternative to regex:
new_string = ' '.join([w for w in old_string.split() if len(w)>3])
Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} ' matches a space, followed by 1-3 word characters, followed by another space. The I doesn't get matched because it doesn't have a space in front of it. The am gets replaced, and then the regex engine starts at the next non-matched character: the i in in. It doesn't see the space before in, since it was placed there by the substitution. So, the next match it finds is a, which produces your output string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With