Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx pattern for partial URL (switch on two values in path)

Tags:

c#

regex

I have a URL pattern that needs to contain either APPLES or ORANGES in it, no other value. Optionally, it can also have query parameters. I've tried a number of RegEx patterns, but I just can't get a pattern that will respect the strict match.

Sample URLs

Good

http://www.website.com/en/pages/APPLES
http://www.website.com/en/pages/APPLES?k=v
http://www.website.com/en/pages/ORANGES?k=v&k2=v2
http://www.website.com/en/pages/ORANGES

Bad

http://www.website.com/en/pages/APPLES???k=v
http://www.website.com/en/pages/APPLES?k=v=v
http://www.website.com/en/pages/APPLESORANGES
http://www.website.com/en/pages/1APPLES
http://www.website.com/en/APPLES

Attempted RegEx Patterns (well, at least the best attempts)

(http://*.*.website*.*.com/*.*/pages(/APPLES)|(/ORANGES)[\?]*.*)
(http://*.*.website*.*.com/*.*/pages(/APPLES|/ORANGES)[\?]*.*)

If you're curious, I intentionally want to allow any sub-domain, suffix after "website" (for different environments), and any path between .com/ and /pages, hence the use of . in a number of places.

What would be the best way to achieve this?

**Edit: Final Answer**

My final answer was merged from mathematical.coffee and fardjad.

^https?://.*\.website\b.*\.com/.*/pages/(APPLES\b|ORANGES\b)((\?\w+=\w+)(&?\w+=\w+)*)?$

The single limitation I've discovered is that it will not allow a few valid characters (.~_-%+) in the query string parameter key=value pairs (see: http://en.wikipedia.org/wiki/Query_string#Structure). This isn't an issue for me as I'm matching against a string returned from .NET's Uri class, so I know the URL is well-formed overall.

like image 923
Nick Tucker Avatar asked Dec 05 '25 11:12

Nick Tucker


1 Answers

I think the *.* should be .*:

http://.*\.website\b.*\.com/.*/pages/PAGE[12](\?[^=]+=[^&=]+(&[^=]+=[^=&]+)*)?

Explanation:

http://      # just http://
.*\.         # any thing, just make sure it's followed by '.'
website\b    # website, the whole word
.*\.com      # anything between website and .com
/.*/pages/   # anything between the .com and the pages
PAGE[12]     # PAGE1 or PAGE2
(\?          # opening bracket and '?' (query string)
[^=]+        # the key: i've said it can't include =
=            # =
[^=&]+       # the value: i've said it can't include = or &
(&           # opening bracket and '&' for next part of query string
[^=]+=[^=&]+ # key=value pair, same regex as before
)*           # 0 or more of these (the &key=value)
)?           # the entire query string is optional.

NOTE - there are usually problems parsing query strings with regex and making sure it's a syntactically valid regex.

For example, in the regex I supplied above, I've said that the value in &key=value can't have an ampersand in it. But it could be an escaped entity, like &, which is legal.

You'll always suffer from this sort of problem when you try to parse syntax with regex. It's a risk you'll have to take.

Alternatively, I am sure there is a C# module to parse URLs (many other languages have these), and they take care of all these special cases for you.

like image 118
mathematical.coffee Avatar answered Dec 06 '25 23:12

mathematical.coffee



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!