Bookmark and Share

PHP Byte Order Mark bug

While debugging my WIB CMS, I bumped on an “interesting” bug in PHP.

Let’s take this PHP script, bom.php:

<?php
$lines = file('test.txt');
foreach($lines as $line) {
    echo $line;
}

And this simple text file, test.txt:

first line
second line

Run the script from the command line and guess what happens? The text gets read and echoed out line by line, so the result is, essentially, the original text file printed to terminal:

$ php bom.php
first line
second line
$

Except that it isn’t, if you used Microsoft Windows Notepad to create that text file in UTF-8 format. If you did, then the output will look — depending on your terminal’s settings — something like this:

first line
second line

WTF? The problem is that Notepad puts a byte order mark (BOM) in the beginning of the file. An UTF-8 BOM is allowed but not recommended by the Unicode standard. PHP doesn’t recognize the BOM, and instead interprets it as three UTF-8 characters. This behavior is known as PHP Bug #22108. It was first reported over six years ago, and has not been fixed so far. Apparently it will be fixed in PHP 6. Nice…

Luckily, the workaround is relatively simple. Just run anything that might have originated from Notepad through this:

function remove_bom($string) {
    if (substr($string, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
        return substr($string, 3);
    } else {
        return $string;
    }
}

It checks if the first three bytes are the BOM, and if yes, then removes them. To be pedant, this function is broken: it doesn’t work properly in the case that you actually wanted to write  in the beginning of your file. Fix that if you want, I don’t :-)

Last modified: 2009-11-22 18:13 +0200


blog comments powered by Disqus