How to extract the hostname from a URI that includes a username and password in Perl?

Question

The sandbox web app I'm working with is a form that receives a LinkedIn URL and uses it to pull the profile photo of a LinkedIn profile. However, its current implementation has an SSRF vulnerability, as the regex we're using to check the link against doesn't account for the possibility of URLs that don't need to begin with a page's hostname (e.g. https://username:[email protected]). Here's the Perl code:

#!/usr/bin/perl

package External;

use warnings;

sub isValidLinkedinProfileUrl {
    $linkedin_url = $_[0];
    my $pattern = qr/^https:\/\/www\.linkedin\.com/;
    return $linkedin_url =~ $pattern;
}

1;

Let's say we give it a "malicious" URL: (https://www.linkedin.com:[email protected])

Our regex would pass it, because it begins with LinkedIn's hostname. How can we fix this URL to "make sure that it takes into account URLs that might start with https://www.linkedin.com but actually go to a different host."?

I've tried adjusting the regex to use a pipe, but don't really know how I should even be adjusting this regex to "fix" it. I assume I should only modify it to only allow LinkedIn hostnames? If I modify it to /^https://www.linkedin.com$/, then that won't work either. I'm not sure what else to try.

ikegami · Accepted Answer

To match the URLs you're trying to match and none other:

use URI qw( );

sub is_linkedin_url {
   my $url = URI->new( shift );

   # Efficient way of handling `undef`
   # values returned for relative URLs.
   no warnings qw( uninitialized );

   return
      (  $url->scheme eq 'https'
      && lc( $url->host ) eq 'www.linkedin.com'
      && $url->port == 443
      );
}

To match the profile URLs and none other:

use URI qw( );

sub is_linkedin_profile_url {
   my $url = URI->new( shift );

   # Efficient way of handling `undef`
   # values returned for relative URLs.
   no warnings qw( uninitialized );

   return
      (  $url->scheme eq 'https'
      && lc( $url->host ) eq 'www.linkedin.com'
      && $url->port == 443
      && $url->path =~ m{^/in/[^/]+/\z}
      );
}

Patrick Janser · Answer

First idea : modifying your regex

I suppose that a password containing a slash should be URL encoded into %2F, as it may mess up some URL parsers. So if this is the case, why not just adding a slash in your regex, after the domain?

And I think that LinkedIn profile URLs all have a path starting with /in/..., so I would check that it is the correct domain, but also check the rest of the URL with something like this:

/^https:\/\/www\.linkedin\.com\/in\/([^#\s\/]*)/

Test it live here: https://regex101.com/r/758taj/1

The rising problems, step by step

Problem 1

Ikegami suggests to also take in consideration a potential port number in the URL. It's totally true, but even if it this case will not happen very often, it can happen. We could handle it by adding (?::443)? after the domain:

Test it live here: https://regex101.com/r/758taj/2

Problem 2

Uppercase URLs would normally also be valid. One could enable the case-insensitive flag for that.

Problem 3

But then comes encoded characters, such as A which can also be replaced by %41. This means that /in/ can also be /%69n/, /i%6e/, /%69%6e/ or even /%69%6E/.

We could replace \/in\/ by \/(?:i|%69)(?:n|%6e)\/, but isn't it just getting a big mess?

In action: https://regex101.com/r/758taj/3

And what about the domain name? we could also have encoded chars. The pattern will get even longer and just not maintainable!

Problem 4

And what if specifications change? Your regex will be outdated.

So this shows you that a regex will be a bad choice, as it will be unreadable and certainly not handle some cases.

The right way of solving it : an URL parser

As Ikegami and others pointed out, using a regular expression to solve it is not ideal. Therefore, a proper parser is safer. I'm not a Perl programmer, but, just like in PHP or JavaScript, any programming language will offer a built-in or library based URL parser.

Ikegami's answer above is the way to go!

I'll leave my answer here to show everyone why regular expressions will cause problems and if you can have a parser, use it instead.

How to extract the hostname from a URI that includes a username and password in Perl?

Tags:

regex

perl

ssrf

amfranks

2 Answers

ikegami

First idea : modifying your regex

The rising problems, step by step

Problem 1

Problem 2

Problem 3

Problem 4

The right way of solving it : an URL parser

Patrick Janser

Recent Activity

Donate For Us

How to extract the hostname from a URI that includes a username and password in Perl?

Tags:

regex

perl

ssrf

amfranks

2 Answers

ikegami

First idea : modifying your regex

The rising problems, step by step

Problem 1

Problem 2

Problem 3

Problem 4

The right way of solving it : an URL parser

Patrick Janser

Related questions

Recent Activity

Donate For Us