Friday, November 28, 2014

Migrating from PHP 5.3 to PHP 5.4 or 5.5 - Watch Out For a Dangerous iconv Bug

If you are using iconv to filter out invalid characters for strings and you migrate to PHP 5.5, you may experience the nasty bug that bit me.


Currently I am converting all my web data from a nice UTF-8 format to ISO-8859-1 (otherwise known as ISO Latin-1) for use for inserting into PDF reports using the fantastic FPDF library.

The code looks something like this:
$clean = @iconv("UTF-8", "ISO-8859-1//IGNORE//TRANSLIT", $text);

I used the error suppression here so no errors get output to the screen when an invalid character needs to be stripped.  If the error displays on the screen, then it interrupts the creation of the PDF file and the user does not get a file.  Not so great, right?

Here is the error that is returned when the error is not suppressed:
Notice: iconv(): Detected an illegal character in input string in {...}

I had been using the //IGNORE directive to direct the function to ignore characters that have errors in them.   I also use //TRANSLIT so that if a character doesn't match exactly to the specific character set, the closest approximation is used.



$text = "Equipment List – Projéct [2014-Nov-28]";
$clean = @iconv("UTF-8", "ISO-8859-1//IGNORE//TRANSLIT", $text);
print($clean);



It may be hard to see, but the first dash '
–' is an em dash and is actually a different character and longer than the '-' en dash (or minus symbol).  Also, I placed a 'e' on project just for good measure.  That doesn't have a representation in the ISO-8859-1 character set according to iconv.

In my PHP version 5.3.29 (with iconv library version 2.17) I get the output:
Equipment List - Proj�ct [2014-Nov-28]

However in PHP version 5.5.19 (also with iconv library version 2.17) with the same code, I get no output at all.


Temporary Solution


So how to correct for that behavior?  Well, I'm not quite sure what's going on in the code for iconv, but I found that if I remove the //IGNORE directive and just leave //TRANSLIT then I am ok.

$text = "Equipment List – Projéct [2014-Nov-28]";
$clean = @iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text);
print($clean);
Output:
Equipment List - Proj�ct [2014-Nov-28]

That's what I wanted, so we should be good for now.  At least until I can examine the iconv source code from GNU Libc and see what is going on.


Update:

After reviewing the source code to the PHP implementation of iconv (from PHP 5.3 iconv to PHP 5.4 iconv) I think the culprit lies in how PHP is calling the iconv library, not the iconv library itself.  There is a line that calls the PHP return value after a check for errors and handles it differently in the later version (line 2390 of PHP5.4+.iconv.c).
    if (err == PHP_ICONV_ERR_SUCCESS && out_buffer != NULL) {
        RETVAL_STRINGL(out_buffer, out_len, 0);
    } else {
        if (out_buffer != NULL) {
            efree(out_buffer);
        }
        RETURN_FALSE;
    }
In the original 5.3 version it just returned what was found (line 2330 of PHP5.3.iconv.c)


    if (out_buffer != NULL) {
        RETVAL_STRINGL(out_buffer, out_len, 0);
    } else {
        RETURN_FALSE;
    }

It looks like that extra check (bolded above) is causing any failure to return FALSE which will give you an empty string ''.