Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to match all punctuation except that inside of a URL

I'm looking for a regular expression to select all punctuation except for that which is inside of a URL.

If I have the string:

This is a URL: https://test.com/ThisIsAURL !

And remove all matches it should become:

This is a URL https://test.com/ThisIsAURL

gsub("[[:punct:]]", "", x) removes all punctuation including from URLs. I've tried using negative look behinds to select punctuation used after https but this was unsuccessful.

In the situation I need it for, all URLs are Twitter link-style URLs https://t.co/. They do not end in .com. Nor do they have more than one backslashed slug (/ThisIsAURL). However, IDEALLY, I'd like the regex to be as versatile as possible, able to perform this operation successfully on any URL.

like image 796
Christopher Costello Avatar asked Oct 15 '25 03:10

Christopher Costello


1 Answers

You may match and capture into Group 1 a URL-like pattern like https?://\S* and then match any punctuation and replace with a backreference to Group 1 to restore the URL in the resulting string:

x <- "This is a URL: https://test.com/ThisIsAURL !"
trimws(gsub("(https?://\\S*)|[[:punct:]]+", "\\1", x, ignore.case=TRUE))
## => [1] "This is a URL https://test.com/ThisIsAURL"

See the R demo online.

The regex is

(https?://\S*)|[[:punct:]]+

See the regex demo.

Details

  • (https?://\S*) - Group 1 (referenced to with \1 from the replacement pattern):
    • https?:// - https:// or http://
    • \S* - 0+ non-whitespace chars
  • | - or
  • [[:punct:]]+ - 1+ punctuation (proper punctuation, symbols and _)
like image 121
Wiktor Stribiżew Avatar answered Oct 18 '25 06:10

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!