Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex capture various url pattern groups

Tags:

python

regex

I have dataset containing the string like this and I want to remove the all the urls from the it

http://google.com having trouble finding regex https://google.com for this case http // google com / test some gibberish https // google . com / test / test1 great http.//google.org

Now, I am using this regex pattern to find the all urls:

https?:?\s?\/\/\s?\S+

Now, ideally, it should capture all the urls such as in this case,

  • http://google.com

  • https://google.com

  • http // google com / test

  • https // google . com / test / test1

  • http.//google.org

but with the regex pattern I have, it is capturing only

  • http://google.com

  • https://google.com

  • http // google

  • https // google

Link to Regex.

like image 370
Tony Montana Avatar asked Dec 11 '25 12:12

Tony Montana


1 Answers

You may use

https?[:.]?\s?\/\/(?:\s*[^\/\s.]+)+(?:\s*\.\s*[^\/\s.]+)*(?:\s*\/\s*[^\/\s]+)*

See the regex demo.

Details

  • https? - http or https
  • [:.]? - an optional : or .
  • \s? - an optional whitespace -\/\/ - // char sequence
  • (?:\s*[^\/\s.]+)+ - (to match all domain name parts till the last . before TLD) 1 or more occurrences of
    • \s* - 0 or more whitespaces
    • [^\/\s.]+ - 1 or more chars other than /, . and whitespace
  • (?:\s*\.\s*[^\/\s.]+)* - 0 or more sequences of
    • \s*\.\s* - a dot enclosed with 0+ whitespaces
    • [^\/\s.]+ - 1 or more chars other than /, . and whitespace
  • (?:\s*\/\s*[^\/\s]+)* - 0 or more sequences of
    • \s*\/\s* - a / enclosed with 0+ whitespaces
    • [^\/\s]+ - 1 or more chars other than / and whitespace
like image 122
Wiktor Stribiżew Avatar answered Dec 13 '25 03:12

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!