Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read in UTF8+BOM file using PHP and not have the BOM appear as content?

Tags:

php

unicode

utf-8

Pretty much what the question says. I've found lots of recommendations for how to strip the byte order mark once the text is read in, but that seems wrong. Isn't there a standard way in the language to read in a Unicode file with proper recognition and treatment of the BOM?

like image 490
Michael Teper Avatar asked Sep 19 '25 09:09

Michael Teper


2 Answers

Nope. You have to do it manually.

The BOM is part of signalling byte order in the UTF-16LE and UTF-16BE encodings, so it makes some sense for UTF-16 decoders to remove it automatically (and so many do).

However UTF-8 always has the same byte order, and aims for ASCII compatibility, so including a BOM was never envisaged as part of the encoding scheme as specified, and so really it isn't supposed to receive any special treatment from UTF-8 decoders.

The UTF-8 faux-BOM is not part of the encoding, but an ad hoc (and somewhat controversial) marker some (predominantly Microsoft) applications use to signal that the file is probably UTF-8. It's not a standard in itself, so specifications that build on UTF-8, like XML and JSON, have had to make special dispensation for it.

like image 129
bobince Avatar answered Sep 20 '25 21:09

bobince


had the same problem. my function _fread() will remove the bom and solved the issue for me...

/**
 * Read local file
 * @param   file   local filename
 * @return  Data from file, or false on failure
 */
function _fread ($file = null) {
    if ( is_readable($file) ) {
        if ( !($fh = fopen($file, 'r')) ) return false;
        $data = fread($fh, filesize($file));

        // remove bom
        $bom = pack('H*','EFBBBF');
        $data = preg_replace("/^$bom/", '', $data);

        fclose($fh);
        return $data;
    }
    return false;
}
like image 32
iProDev Avatar answered Sep 20 '25 21:09

iProDev