The sandbox web app I'm working with is a form that receives a LinkedIn URL and uses it to pull the profile photo of a LinkedIn profile. However, its current implementation has an SSRF vulnerability, as the regex we're using to check the link against doesn't account for the possibility of URLs that don't need to begin with a page's hostname (e.g. https://username:[email protected]). Here's the Perl code:
#!/usr/bin/perl
package External;
use warnings;
sub isValidLinkedinProfileUrl {
$linkedin_url = $_[0];
my $pattern = qr/^https:\/\/www\.linkedin\.com/;
return $linkedin_url =~ $pattern;
}
1;
Let's say we give it a "malicious" URL: (https://www.linkedin.com:[email protected])
Our regex would pass it, because it begins with LinkedIn's hostname. How can we fix this URL to "make sure that it takes into account URLs that might start with https://www.linkedin.com but actually go to a different host."?
I've tried adjusting the regex to use a pipe, but don't really know how I should even be adjusting this regex to "fix" it. I assume I should only modify it to only allow LinkedIn hostnames? If I modify it to /^https://www.linkedin.com$/, then that won't work either. I'm not sure what else to try.
To match the URLs you're trying to match and none other:
use URI qw( );
sub is_linkedin_url {
my $url = URI->new( shift );
# Efficient way of handling `undef`
# values returned for relative URLs.
no warnings qw( uninitialized );
return
( $url->scheme eq 'https'
&& lc( $url->host ) eq 'www.linkedin.com'
&& $url->port == 443
);
}
To match the profile URLs and none other:
use URI qw( );
sub is_linkedin_profile_url {
my $url = URI->new( shift );
# Efficient way of handling `undef`
# values returned for relative URLs.
no warnings qw( uninitialized );
return
( $url->scheme eq 'https'
&& lc( $url->host ) eq 'www.linkedin.com'
&& $url->port == 443
&& $url->path =~ m{^/in/[^/]+/\z}
);
}
I suppose that a password containing a slash should be URL encoded into
%2F, as it may mess up some URL parsers. So if this is the case, why not just
adding a slash in your regex, after the domain?
And I think that LinkedIn profile URLs all have a path starting with
/in/..., so I would check that it is the correct domain, but also check the
rest of the URL with something like this:
/^https:\/\/www\.linkedin\.com\/in\/([^#\s\/]*)/
Test it live here: https://regex101.com/r/758taj/1
Ikegami suggests to also take in consideration a potential port number in
the URL. It's totally true, but even if it this case will not happen very
often, it can happen. We could handle it by adding (?::443)? after the domain:
Test it live here: https://regex101.com/r/758taj/2
Uppercase URLs would normally also be valid. One could enable the case-insensitive flag for that.
But then comes encoded characters, such as A which can also be replaced
by %41. This means that /in/ can also be /%69n/, /i%6e/, /%69%6e/
or even /%69%6E/.
We could replace \/in\/ by \/(?:i|%69)(?:n|%6e)\/, but isn't it just
getting a big mess?
In action: https://regex101.com/r/758taj/3
And what about the domain name? we could also have encoded chars. The pattern will get even longer and just not maintainable!
And what if specifications change? Your regex will be outdated.
So this shows you that a regex will be a bad choice, as it will be unreadable and certainly not handle some cases.
As Ikegami and others pointed out, using a regular expression to solve it is not ideal. Therefore, a proper parser is safer. I'm not a Perl programmer, but, just like in PHP or JavaScript, any programming language will offer a built-in or library based URL parser.
Ikegami's answer above is the way to go!
I'll leave my answer here to show everyone why regular expressions will cause problems and if you can have a parser, use it instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With