I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.
The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.
I also have the following function to remove non-ascii characters:
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
if($current != 0x1F)
{
$ret .= chr($current);
}
}
else
{
$ret .= " ";
}
}
return $ret;
}
However this still is not removing it. If I step through the code the illegal character is expanded out to in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)
251gm-50
Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.
EDIT
After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like )
You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.
I found iconv quite useful:
$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);
This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.
I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.
A linux command line tool that guesses a file's encoding is enca
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With