Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and replace URLS in a block of text, return text + list of URLS

Tags:

python

I'm trying to find a way to take a block of text, replace all URLs in that text with some other text, then return the new text chunk and a list of URLs it found. Something like:

text = """This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"""
text, urls = FindURLs(text, "{{URL}}")

Should give:

text = "This is some text {{URL}} blah blah {{URL}} lol"
urls = ["www.google.com", "http://www.imgur.com/12345.jpg"]

I know this will involve some regex - I've found some seemingly good URL detection regex here: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

I'm pretty rubbish with regex, though, so I'm finding that getting it to do what I want with python quite tricky. The order that the URLs are returned in doesn't really matter.

Thanks :)

like image 998
combatdave Avatar asked Jan 01 '26 01:01

combatdave


1 Answers

The regular expression here should be liberal enough to catch urls without http or www.

Here's some simplistic python code that performs the text substitution and gives you a list of the results:

import re

url_regex = re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>\[\]]+|\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\))+(?:\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\)|[^\s`!(){};:'".,<>?\[\]]))""")

text = "This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"
matches = []

def process_match(m):
    matches.append(m.group(0))
    return '{{URL}}'

new_text = url_regex.sub(process_match, text)

print new_text
print matches
like image 191
obsoleter Avatar answered Jan 02 '26 14:01

obsoleter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!