Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regular expression find and replace html tag with specific attribute value

Tags:

python

regex

Am trying to write a regular expression in python that would find all img tags where src attribute equal to a specific value.. I tried to write the below

   # where thm equal /public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82
   p = re.compile(r'<img.*?%s.*?>' % thm)
   print p.pattern
   print p.sub(linked_image, c)

and below what I got as an output

<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
like image 697
Mo J. Mughrabi Avatar asked Feb 01 '26 01:02

Mo J. Mughrabi


1 Answers

A solution with Regular Expressions:

I realized that the string which was inserted into thm was not escaped. So before adding it to your regular expression you need to escape all symbols with an additional signification in the Regular Expression language - here, ? and ..

I replaced ? with [?]{1} and . with \.. The resulting regular expressions matches now the test string.

import re
thm = '/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82'

all_html_code = '''<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf '''

escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'<img.*?src="(%s)".*?>' % escaped_thm)

test_img = '''<img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt="">'''
print p.match(test_img)

new_img_tag = '<img src="/python/logo.jpg" alt=""/>'
print p.sub(new_img_tag, all_html_code)

By the way, why are you looking for <img src=""...>? You could replace the src attribute directly:

escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'src="(%s)"' % escaped_thm)

replacement = '''src="/python/logo.jpg"'''
print p.sub(replacement, all_html_code)

Output 1

<_sre.SRE_Match object at 0x... >
<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/python/logo.jpg" alt=""/></p><p>lksj lksdfj ... lksdjf 

Output 2

<p><img src="/python/logo.jpg" alt=""></p><p>lksj lksdfj ... lksdjf 

After asking about a proper way to escape symbols of regular expressions (Regular expression to escape regular expressions) - I can recommend an re.escape instead of the two replace methods.

Using LXML

Do you need to use regular expressions? Regular expressions for HTML can be highly problematic. See more information and wonderful post here(RegEx match open tags except XHTML self-contained tags).

I would rather use XPath like here.

like image 190
Jon Avatar answered Feb 03 '26 14:02

Jon