is_utf8() – check for UTF-8

With this PHP function it's possible to check whether a string is encoded as UTF-8 or not, or seems to be, at least.

It scans a string for invalid UTF-8 characters (or bytes) and returns false, if it finds any.

<?php
function is_utf8($str) {
    $strlen = strlen($str);
    for ($i = 0; $i < $strlen; $i++) {
        $ord = ord($str[$i]);
        if ($ord < 0x80) continue; // 0bbbbbbb
        elseif (($ord & 0xE0) === 0xC0 && $ord > 0xC1) $n = 1; // 110bbbbb (exkl C0-C1)
        elseif (($ord & 0xF0) === 0xE0) $n = 2; // 1110bbbb
        elseif (($ord & 0xF8) === 0xF0 && $ord < 0xF5) $n = 3; // 11110bbb (exkl F5-FF)
        else return false; // invalid UTF-8-Zeichen
        for ($c=0; $c<$n; $c++) // $n following bytes? // 10bbbbbb
            if (++$i === $strlen || (ord($str[$i]) & 0xC0) !== 0x80)
                return false; // invalid UTF-8 char
    }
    return true; // didn't find any invalid characters
}

# example usage
echo is_utf8($str) ? $str : utf8_encode($str);
?>

Important: This function may return a wrong result, if a string encoded in ISO-8859 or any other encoding is also valid in UTF-8 by chance, but that mostly just happens with random strings, not with human readable text. If the string just consists of 7-byte characters, it will be detected as UTF-8, even though it's meant to be encoded in ISO-8859, since there is no effective difference between the two encodings in this case.

Note: There is a method using regular expressions that also allows checking if a string is UTF-8, but it seems to be slower on strings longer than 100 bytes.

If you have access to the Multibyte String extensions you might alo use the following method:

<?php
if (mb_detect_encoding($str, 'UTF-8, ISO-8859-1') === 'UTF-8'){
    # the string is encoded in UTF-8
}
?>

Links