So I have a form element which is being submitted to a controller / model in an app I have built and I need to strip away any HTML which does not conform to my requirements and convert other HTML to be a proprietary tag for the app and wondered is someone could look at my regex preg_replace and see what could be done to improve it.
$postText = $_POST['post_text'];
//Regex Functions
$p1 = '~<span class=\"atwho-view-flag atwho-view-flag-#\" c>|<span c class \"atwho-view-flag atwho-view-flag-#\">|<span c class \"atwho-view-flag atwho-view-flag-@\">|<span contenteditable=\\"false\\" class=\\"atwho-view-flag atwho-view-flag-@\\">|<span contenteditable=\\"false\\" class=\\"atwho-view-flag atwho-view-flag-#\\">|</span>|<span>|<span c>|<span contenteditable=\\"false\\">| | |<br>~';
$r1 = '';
$start = preg_replace($p1, $r1, $postText);
$clean = str_replace('_','',$start);
$users = preg_replace("~(<var data-type=\"user\" class=\"userHighlight\" id=\"(.*?)\">)(.*?)(</var>)~", "<_link>$2|$3</_link> ", $clean);
$tags = preg_replace("~(<var data-type=\"tag\" class=\"tagHighlight\" id=\"(.*?)\">)#(.*?)(</var>)~", "<_link>tag://$3|#$3</_link> ", $users);
$last = preg_replace("~(^|\\s)#(\\w*[a-zA-Z_]+\\w*)~", " <_link>tag://$2|#$2</_link> ", $tags);
$spaces = preg_replace("~(^ | )~", " ", $last);
$divs = preg_replace("~(?:</?div>)+~", "\r\n", $spaces);
$final = preg_replace("~(<br>)~", "\r\n", $divs);
I am using a contenteditable div which uses the at.js by ichord library to allow for hash tagging and user mentions I essentially want to convert the following tags (as shown above)
Posted content:
<span contenteditable="false" class="atwho-view-flag atwho-view-flag-#"><var data-type="tag" class="tagHighlight" id="tag://4">#Hashtag</var><span contenteditable="false"> <span></span></span></span>is <span contenteditable="false" class="atwho-view-flag atwho-view-flag-#"><var data-type="tag" class="tagHighlight" id="tag://2">#AnotherHashtag</var><span contenteditable="false"> <span></span></span></span>and <span contenteditable="false" class="atwho-view-flag atwho-view-flag-@"><var data-type="user" class="userHighlight" id="user://82">A Username </var><span contenteditable="false"> <span></span></span></span>made it so...
Hashtag:
<var data-type="tag" class="tagHighlight" id="tag://2">#AnotherHashtag </var>
User Mention:
<var data-type="user" class="userHighlight" id="user://82">A Username </var>
In the main my PHP is working but every now and then I get spurious HTML which I don't need.
Lastly there are some other elements in the preg_replace() which deal with carriage returns which in the case of my contenteditable are being sent over as <div></div> or <br> elements and I need to preserve the carriage returns.
Hopefully I have explained it all as clearly as possible, thanks in advance for your help.
Maybe this will help you
I assume you are only interessted in the <var>-tags (ok, also in <div> and <br> for formatting purposes), so just remove all other tags (using of string functions without regular expressions is often the better way if speed isn't unimportant) with the PHP function strip_tags (strip_tags($postText, '<var><div><br>'))
Replacing all other tags than <var>, <div> or <br> and entities with a space
$clearedText = str_replace(
' ',
' ',
strip_tags($postText, '<var><div><br>')
);
Consolidating all spaces to one after trimming trailing and leading spaces via trim(...)
$clearedText = preg_replace(
'~\s+~',
' ',
trim($clearedText)
);
Replacing all occurences of <div></div> and <br> with a windows line break
$clearedText = preg_replace(
'~<div></div>|<br\s*/?>~',
"\r\n",
$clearedText
);
Converting <var> tags to <_link> tags
$linkText = preg_replace(
'~<(var)[^>]*id="((?:tag|user)://\d+)"[^>]*>((?:[^<]+|<(?!/\1>))*)</\1>~',
'<_link>\2|\2</_link>',
$clearedText
);
Fixing content of <_link> tags with content tag://NUMBER|#HASH with correct content to tag://HASH|#HASH
$linkText = preg_replace(
'~(?<=tag://)\d+(\|#(\w+))~',
'\2\1',
$linkText
);
For better understading of the last two regular expressions:
<(var)[^>]*id="((?:tag|user)://\d+)"[^>]*>((?:[^<]+|<(?!/\1>))*)</\1>

Debuggex Demo
(?<=tag://)\d+(\|#(\w+))

Debuggex Demo
if i'm right to understand your question then this will work for you
$final = preg_replace("~(<+[A-Za-z0-9\/]+>)~", "\r\n", $divs);
this expession remove all unwanted html tags
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With