Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract markdown links with a regex?

I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.

import re

# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'

for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
    name = match[0]
    url = match[1]
    print(url)
    # url will be https://wiki.com/atopic_(subtopic

This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.

How can I make the regex respect up till the final bracket?

like image 253
James Bradbury Avatar asked Oct 21 '25 05:10

James Bradbury


1 Answers

For those types of urls, you'd need a recursive approach which only the newer regex module supports:

import regex as re

data = """
It's very easy to make some words **bold** and other words *italic* with Markdown. 
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""

pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')

for match in pattern.finditer(data):
    description, _, url = match.groups()
    print(f"{description}: {url}")

This yields

link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)

See a demo on regex101.com.


This cryptic little beauty boils down to

\[([^][]+)\]           # capture anything between "[" and "]" into group 1
(\(                    # open group 2 and match "("
    ((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
                       # capture the content into group 3 (the url)
\))                    # match ")" and close group 2

NOTE: The problem with this approach is that it fails for e.g. urls like

[some nasty description](https://google.com/()
#                                          ^^^

which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.

like image 62
Jan Avatar answered Oct 23 '25 22:10

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!