Using lxml, how do you globally replace all src attributes with an absolute link?
Here is an example code which also covers <a href>:
from lxml import etree, html
import urlparse
def fix_links(content, absolute_prefix):
"""
Rewrite relative links to be absolute links based on certain URL.
@param content: HTML snippet as a string
"""
if type(content) == str:
content = content.decode("utf-8")
parser = etree.HTMLParser()
content = content.strip()
tree = html.fragment_fromstring(content, create_parent=True)
def join(base, url):
"""
Join relative URL
"""
if not (url.startswith("/") or "://" in url):
return urlparse.urljoin(base, url)
else:
# Already absolute
return url
for node in tree.xpath('//*[@src]'):
url = node.get('src')
url = join(absolute_prefix, url)
node.set('src', url)
for node in tree.xpath('//*[@href]'):
href = node.get('href')
url = join(absolute_prefix, href)
node.set('href', url)
data = etree.tostring(tree, pretty_print=False, encoding="utf-8")
return data
The full story is available in Plone developer documentation.
I'm not sure when this was added, but documents created from lxml.fromstring() now have a method called make_links_absolute. From the documentation:
make_links_absolute(base_href, resolve_base_href=True):
This makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.
If resolve_base_href is true, then any tag will be taken into account (just calling self.resolve_base_href()).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With