I have dataset containing the string like this and I want to remove the all the urls from the it
http://google.com having trouble finding regex https://google.com for this case http // google com / test some gibberish https // google . com / test / test1 great http.//google.org
Now, I am using this regex pattern to find the all urls:
https?:?\s?\/\/\s?\S+
Now, ideally, it should capture all the urls such as in this case,
http://google.com
https://google.com
http // google com / test
https // google . com / test / test1
http.//google.org
but with the regex pattern I have, it is capturing only
http://google.com
https://google.com
http // google
https // google
Link to Regex.
You may use
https?[:.]?\s?\/\/(?:\s*[^\/\s.]+)+(?:\s*\.\s*[^\/\s.]+)*(?:\s*\/\s*[^\/\s]+)*
See the regex demo.
Details
https? - http or https[:.]? - an optional : or .\s? - an optional whitespace
-\/\/ - // char sequence(?:\s*[^\/\s.]+)+ - (to match all domain name parts till the last . before TLD) 1 or more occurrences of
\s* - 0 or more whitespaces[^\/\s.]+ - 1 or more chars other than /, . and whitespace(?:\s*\.\s*[^\/\s.]+)* - 0 or more sequences of
\s*\.\s* - a dot enclosed with 0+ whitespaces[^\/\s.]+ - 1 or more chars other than /, . and whitespace(?:\s*\/\s*[^\/\s]+)* - 0 or more sequences of
\s*\/\s* - a / enclosed with 0+ whitespaces[^\/\s]+ - 1 or more chars other than / and whitespaceIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With