I am currently using this regular expression in my C# / .NET Core app to parse HTTP, HTTPS & FTP urls from a markdown file:
static readonly Regex _urlRegex = new Regex(@"(((http|ftp|https):\/\/)+[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)");
void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias));
//handle updated markdown
}
static string HandleRegex(in string url, in string repositoryName, in string channel, in string alias)
{
//handle url
}
I am looking to update this regex to ignore URLs inside of markdown code blocks and markdown code snippets.
The following URL should be ignored because it is inside of a code block:
` ` `
{
"name": "Brandon",
"blog" : "https://codetraveler.io"
}
` ` `
The following URL should be ignored because it is inside of a code snippet:
`curl -I https://www.keycdn.com `
You can leverage your existing code that already has a match evaluator as the replacement argument in Regex.Replace.
You need to add an alternative (with | alternation operator) to the current regex that would match the contexts where you want to ignore matches, and then check which group matched.
The alternative you should add is (?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1, it matches
(?<!`) - no backtick immediately to the left is allowed(`(?:`{2})?) - Group 1: a backtick and then an optional double backtick sequence(?:(?!\1).)*? - any char other than a line break char, zero or more occurrences but as few as possible, that does not start the same char sequence that is captured in Group 1\1 - the same char sequence that is captured in Group 1See the sample code:
static readonly Regex _urlRegex = new Regex(@"(?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1|((?:ht|f)tps?://[\w-]+(?>\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?)", RegexOptions.Singleline);
void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => x.Groups[2].Success ?
HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias) : x.Value);
//handle updated markdown
}
I modified the URL pattern a bit to make it cleaner and more efficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With