On a website I'm creating I'm using Python-Markdown to format news posts. To avoid issues with dead links and HTTP-content-on-HTTPS-page problems I'm requiring editors to upload all images to the site and then embed them (I'm using a markdown editor which I've patched to allow easy embedding of those images using standard markdown syntax).
However, I'd like to enforce the no-external-images policy in my code.
One way would be writing a regex to extract image URLs from the markdown sourcecode or even run it through the markdown renderer and use a DOM parser to extract all src attributes from img tags.
However, I'm curious if there's some way to hook into Python-Markdown to extract all image links or execute custom code (e.g. raising an exception if the link is external) during parsing.
One approach would be to intercept the <img> node at a lower level just after Markdown parses and constructs it:
import re
from markdown import Markdown
from markdown.inlinepatterns import ImagePattern, IMAGE_LINK_RE
RE_REMOTEIMG = re.compile('^(http|https):.+')
class CheckImagePattern(ImagePattern):
    def handleMatch(self, m):
        node = ImagePattern.handleMatch(self, m)
        # check 'src' to ensure it is local
        src = node.attrib.get('src')
        if src and RE_REMOTEIMG.match(src):
            print 'ILLEGAL:', m.group(9)
            # or alternately you could raise an error immediately
            # raise ValueError("illegal remote url: %s" % m.group(9))
        return node
DATA = '''


'''
mk = Markdown()
# patch in the customized image pattern matcher with url checking
mk.inlinePatterns['image_link'] = CheckImagePattern(IMAGE_LINK_RE, mk)
result = mk.convert(DATA)
print result
Output:
ILLEGAL: http://remote.com/path/to/img.jpg
<p><img alt="Alt text" src="/path/to/img.jpg" />
<img alt="Alt text" src="http://remote.com/path/to/img.jpg" /></p>
Updated with Python 3 and Python-Mardown 3
import re
from markdown import Markdown
from markdown.inlinepatterns import Pattern, IMAGE_LINK_RE
RE_REMOTEIMG = re.compile('^(http|https):.+')
class CheckImagePattern(Pattern):
    def handleMatch(self, m):
        node = Pattern.handleMatch(self, m)
        # check 'src' to ensure it is local
        src = node.attrib.get('src')
        if src and RE_REMOTEIMG.match(src):
            print 'ILLEGAL:', m.group(9)
            # or alternately you could raise an error immediately
            # raise ValueError("illegal remote url: %s" % m.group(9))
        return node
DATA = '''


'''
mk = Markdown()
# patch in the customized image pattern matcher with url checking
mk.inlinePatterns['image_link'] = CheckImagePattern(IMAGE_LINK_RE, mk)
result = mk.convert(DATA)
print result
Hope it's useful!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With