Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fun with PHP RegEx (preg_replace)

So I have a form element which is being submitted to a controller / model in an app I have built and I need to strip away any HTML which does not conform to my requirements and convert other HTML to be a proprietary tag for the app and wondered is someone could look at my regex preg_replace and see what could be done to improve it.

$postText = $_POST['post_text'];
//Regex Functions
$p1 = '~<span class=\"atwho-view-flag atwho-view-flag-#\" c>|<span c class \"atwho-view-flag atwho-view-flag-#\">|<span c class \"atwho-view-flag atwho-view-flag-@\">|<span contenteditable=\\"false\\" class=\\"atwho-view-flag atwho-view-flag-@\\">|<span contenteditable=\\"false\\" class=\\"atwho-view-flag atwho-view-flag-#\\">|</span>|<span>|<span c>|<span contenteditable=\\"false\\">|&nbsp;|&nbsp|<br>~';
$r1 = '';
$start = preg_replace($p1, $r1, $postText);
$clean = str_replace('_','',$start);
$users = preg_replace("~(<var data-type=\"user\" class=\"userHighlight\" id=\"(.*?)\">)(.*?)(</var>)~", "<_link>$2|$3</_link> ", $clean);
$tags = preg_replace("~(<var data-type=\"tag\" class=\"tagHighlight\" id=\"(.*?)\">)#(.*?)(</var>)~", "<_link>tag://$3|#$3</_link> ", $users);
$last = preg_replace("~(^|\\s)#(\\w*[a-zA-Z_]+\\w*)~", " <_link>tag://$2|#$2</_link> ", $tags);
$spaces = preg_replace("~(^&nbsp;|&nbsp)~", " ", $last);
$divs = preg_replace("~(?:</?div>)+~", "\r\n", $spaces);
$final = preg_replace("~(<br>)~", "\r\n", $divs);

I am using a contenteditable div which uses the at.js by ichord library to allow for hash tagging and user mentions I essentially want to convert the following tags (as shown above)

Posted content:

<span contenteditable="false" class="atwho-view-flag atwho-view-flag-#"><var data-type="tag" class="tagHighlight" id="tag://4">#Hashtag</var><span contenteditable="false">&nbsp;<span></span></span></span>is <span contenteditable="false" class="atwho-view-flag atwho-view-flag-#"><var data-type="tag" class="tagHighlight" id="tag://2">#AnotherHashtag</var><span contenteditable="false">&nbsp;<span></span></span></span>and <span contenteditable="false" class="atwho-view-flag atwho-view-flag-@"><var data-type="user" class="userHighlight" id="user://82">A Username </var><span contenteditable="false">&nbsp;<span></span></span></span>made it so...

Hashtag:

<var data-type="tag" class="tagHighlight" id="tag://2">#AnotherHashtag </var>

User Mention:

<var data-type="user" class="userHighlight" id="user://82">A Username </var>

In the main my PHP is working but every now and then I get spurious HTML which I don't need.

Lastly there are some other elements in the preg_replace() which deal with carriage returns which in the case of my contenteditable are being sent over as <div></div> or <br> elements and I need to preserve the carriage returns.

Hopefully I have explained it all as clearly as possible, thanks in advance for your help.

like image 867
Justin Erswell Avatar asked Jan 29 '26 18:01

Justin Erswell


2 Answers

Maybe this will help you

I assume you are only interessted in the <var>-tags (ok, also in <div> and <br> for formatting purposes), so just remove all other tags (using of string functions without regular expressions is often the better way if speed isn't unimportant) with the PHP function strip_tags (strip_tags($postText, '<var><div><br>'))

Replacing all other tags than <var>, <div> or <br> and &nbsp; entities with a space

$clearedText = str_replace(
    '&nbsp;', 
    ' ', 
    strip_tags($postText, '<var><div><br>')
);

Consolidating all spaces to one after trimming trailing and leading spaces via trim(...)

$clearedText = preg_replace(
    '~\s+~',
    ' ',
    trim($clearedText)
);

Replacing all occurences of <div></div> and <br> with a windows line break

$clearedText = preg_replace(
    '~<div></div>|<br\s*/?>~',
    "\r\n",
    $clearedText
);

Converting <var> tags to <_link> tags

$linkText = preg_replace(
    '~<(var)[^>]*id="((?:tag|user)://\d+)"[^>]*>((?:[^<]+|<(?!/\1>))*)</\1>~',
    '<_link>\2|\2</_link>',
    $clearedText
);

Fixing content of <_link> tags with content tag://NUMBER|#HASH with correct content to tag://HASH|#HASH

$linkText = preg_replace(
    '~(?<=tag://)\d+(\|#(\w+))~',
    '\2\1',
    $linkText
);

For better understading of the last two regular expressions:

<(var)[^>]*id="((?:tag|user)://\d+)"[^>]*>((?:[^<]+|<(?!/\1>))*)</\1>

Regular expression visualization

Debuggex Demo


(?<=tag://)\d+(\|#(\w+))

Regular expression visualization

Debuggex Demo

like image 180
bukart Avatar answered Jan 31 '26 07:01

bukart


if i'm right to understand your question then this will work for you

$final = preg_replace("~(<+[A-Za-z0-9\/]+>)~", "\r\n", $divs);

this expession remove all unwanted html tags

like image 25
Man Programmer Avatar answered Jan 31 '26 08:01

Man Programmer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!