Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3.3.2 - Finding Image Sources in HTML

I need to locate and extract image sources from an html file. For example, it might contain:

<image class="logo" src="http://example.site/logo.jpg">

or

<img src="http://another.example/picture.png">

Using Python. I would not like to use any third party programs. I can use the RE module, though. The program should:

  • sift through everything
  • seek out the img or image tags
  • find the src and get the attribute value (without the double quotes)

Is this possible, and if so, how can I do it? We can assume that I don't need to access the internet to do this (I have a file called website.html that contains all the html code).

EDIT: My current Regex expression is

r'<img[^>]*\ssrc="(.*?)"'

and

r'<image[^>]*\ssrc="(.*?)"'.

The main problem is that the expression will pick up anything starting with img or image. For example, if there was something saying <imagesomethingrandom src="website">, it would still count that as an image (as the word image is at the start) and it would add the source.

Thanks in advance.

Rob.

like image 879
Rob Alsod Avatar asked Mar 07 '26 18:03

Rob Alsod


1 Answers

Description

This expression will:

  • find all image and img tags which have a src attribute
  • ignore tags which are not image or img, like imagesomethingrandom
  • capture the value of the src attribute
  • correctly handle single, double or non quoted attribute values
  • avoid most of the tricky edge cases which seem to trip up regular expresses when matching html

<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

Examples

Live Regex Demo
Live Python Demo

Sample Text

Note the rather difficult edge cases in the first line

<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>

Python Code

#!/usr/bin/python
import re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0

for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print " "
    print "[", intCount, "][ 0 ] : ", matchObj.group(0)
    print "[", intCount, "][ 1 ] : ", matchObj.group(1)
    print "[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

Capture Groups

Group 0 gets the entire image or img tag
Group 1 gets the quote which surrounded src attribute, if it exists
Group 2 gets the src attribute value

[ 0 ][ 0 ] :  <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png
like image 70
Ro Yo Mi Avatar answered Mar 09 '26 07:03

Ro Yo Mi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!