Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect and edit external links

I want to say that I'm aware of similar questions on SO, but since my situation is slightly different I thought it would be better to open a new question. I did search for an hour, I may have missed something, if so please forgive me for that.

The problem: I'm developing a feature similar to facebook: user can post a text message which may contain a number of links, these may or many not be put in anchor tags, and may have different protocols (http, https, ftp,....)

I need to

  1. detect these links and perhaps attempting to retrieve them (just like facebook). I guess this is the task for jquery?

  2. I also need to reliably detect the external links and change them to mysite.com/external?url=thelink. Which, I believe, is that task for php (since I can't trust the input coming from client side right?)

Anyhow, with the links not guaranteed to be in anchor tags, it doesn't seem very reliable to use a dom parser (or am I wrong)? I found a simple regex on the web (Im terrible with regex btw) which I think I can make use of (by adding a lot more protocols)

$strText = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a href="\0">\4</a>', $strText );  

Can some experts out there who have experience in this task please point me to the right direction?

like image 829
mr1031011 Avatar asked Nov 20 '25 09:11

mr1031011


1 Answers

Yup, this is definitely something you want to do server-side. First off, if you're accepting user input containing HTML markup, you should be sanitizing it with a good HTML filter like HTML Purifier. (This will also make their input easier to parse for more complex markup.)

This should be doable within a single preg_replace() statement, but I'd split it into something like this:

$hrefPattern = '/<a[^>]+?href="(.+?)".*?>/i';

$outLink = 'http://mysite.com/external?url=';

$offset = 0;
while(preg_match($hrefPattern, $text, $hrefMatches, PREG_OFFSET_CAPTURE, $offset))
{

    $hrefInner = $hrefMatches[1][0];
    $offset = $hrefMatches[1][1];
    echo $hrefInner . "\r\n";

    if(strpos($hrefInner, '://') !== false)
    {
        $externalUrl = $outLink . rawurlencode($hrefInner);
        $text = str_replace($hrefInner, $externalUrl, $text);
        $offset += strlen($externalUrl);
    }

}

The preg_match() documentation explains that pretty well. We're basically just looking up each <a ... href=""> tag, grabbing it's contents, reformatting it if it starts with (anything)://, and repeating until there are no more links left in $text. If you reformat the link, you need to rawurlencode() the link you scraped to make sure the new link is valid.

The way Facebook scrapes content for it's link snippets is, I'd imagine, a lot more complex than that, but yes - you'd want to send an AJAX request to a PHP page that scrapes the link in question and generates whatever snippet you want. There's quite a bit more involved in that, though- you'll have to handle if the page does not exist, redirects to another page, has invalid markup, different document types, and so on.

Hope that helps!

like image 171
Chris Hepner Avatar answered Nov 22 '25 22:11

Chris Hepner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!