I am trying to correct a text that has some very typical scanning errors (l mistaken for I and vice-versa). Basically I would like to have the replacement string in re.sub to depend on the number of times the 'I' is detected, something like that:
re.sub("(\w+)(I+)(\w*)", "\g<1>l+\g<3>", "I am stiII here.")
What's the best way to achieve this?
Pass a function as the replacement string, as described in the docs. Your function can identify the mistake and create the best substitution based on that.
def replacement(match):
if "I" in match.group(2):
return match.group(1) + "l" * len(match.group(2)) + match.group(3)
# Add additional cases here and as ORs in your regex
re.sub(r"(\w+)(II+)(\w*)", replacement, "I am stiII here.")
>>> I am still here.
(note that I modified your regex so the repeated Is would appear in one group.)
You can use a lookaround to replace only Is followed by or preceded by another I:
print re.sub("(?<=I)I|I(?=I)", "l", "I am stiII here.")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With