Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse URLs using Regex, Ignoring Code Blocks and Code Snippets in Markdown

Tags:

c#

regex

markdown

I am currently using this regular expression in my C# / .NET Core app to parse HTTP, HTTPS & FTP urls from a markdown file:

static readonly Regex _urlRegex = new Regex(@"(((http|ftp|https):\/\/)+[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)");

void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
    var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias));

    //handle updated markdown
}

static string HandleRegex(in string url, in string repositoryName, in string channel, in string alias)
{
    //handle url
}

I am looking to update this regex to ignore URLs inside of markdown code blocks and markdown code snippets.

Example 1

The following URL should be ignored because it is inside of a code block:

` ` `
{ "name": "Brandon", "blog" : "https://codetraveler.io" }

` ` `

Example 2

The following URL should be ignored because it is inside of a code snippet:

`curl -I https://www.keycdn.com `

like image 703
Brandon Minnick Avatar asked Dec 09 '25 07:12

Brandon Minnick


1 Answers

You can leverage your existing code that already has a match evaluator as the replacement argument in Regex.Replace.

You need to add an alternative (with | alternation operator) to the current regex that would match the contexts where you want to ignore matches, and then check which group matched.

The alternative you should add is (?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1, it matches

  • (?<!`) - no backtick immediately to the left is allowed
  • (`(?:`{2})?) - Group 1: a backtick and then an optional double backtick sequence
  • (?:(?!\1).)*? - any char other than a line break char, zero or more occurrences but as few as possible, that does not start the same char sequence that is captured in Group 1
  • \1 - the same char sequence that is captured in Group 1

See the sample code:

static readonly Regex _urlRegex = new Regex(@"(?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1|((?:ht|f)tps?://[\w-]+(?>\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?)", RegexOptions.Singleline);

void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
    var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => x.Groups[2].Success ?
         HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias) : x.Value);

    //handle updated markdown
}

I modified the URL pattern a bit to make it cleaner and more efficient.

like image 118
Wiktor Stribiżew Avatar answered Dec 11 '25 19:12

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!