127

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences?

I found similar question here but is doesn't seem to work.

0

8 Answers 8

218

Try this:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, $str);

In case it's UTF-16 based C/C++/Java/Json-style:

$str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UTF-16BE');
}, $str);
Sign up to request clarification or add additional context in comments.

13 Comments

@Docstero: The regular expression will match any sequence of \u followed by four hexadecimal digits.
Warning: preg_replace_callback() [function.preg-replace-callback]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
This function cannot deal with supplementary characters as they cannot be represented in UCS-2.
@gumbo How do you call or use this function?
I found my way here as I had \u00ed in my output, but I was looking at the output with json_encode() and funnily enough the default json_encode() will trash up the output so use json_encode($theDict,JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);
|
86
print_r(json_decode('{"t":"\u00ed"}')); // -> stdClass Object ( [t] => í )

5 Comments

It doesn't even need the object wrapper: json_decode('"' . $text . '"')
Thanks. This seems to be STANDARD WAY, rather then accepted answer.
Interestingly, this also works for complex entities like smiley faces... json_decode('{"t":"\uD83D\uDE0A"}') is �?�
@deceze you should include the fact that $text can include double quotes. So a revised version would be: json_decode('"'.str_replace('"', '\\"', $text).'"'). Thanks for your help :-)
This approach didn't work for me for decoding an entire article with various problematic characters. I get this json error: JSON_ERROR_CTRL_CHAR. The accepted answer works for me.
27

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

echo "\u{00ed}"; outputs í.

Comments

18
$str = '\u0063\u0061\u0074'.'\ud83d\ude38';
$str2 = '\u0063\u0061\u0074'.'\ud83d';

// U+1F638
var_dump(
    "cat\xF0\x9F\x98\xB8" === escape_sequence_decode($str),
    "cat\xEF\xBF\xBD" === escape_sequence_decode($str2)
);

function escape_sequence_decode($str) {

    // [U+D800 - U+DBFF][U+DC00 - U+DFFF]|[U+0000 - U+FFFF]
    $regex = '/\\\u([dD][89abAB][\da-fA-F]{2})\\\u([dD][c-fC-F][\da-fA-F]{2})
              |\\\u([\da-fA-F]{4})/sx';

    return preg_replace_callback($regex, function($matches) {

        if (isset($matches[3])) {
            $cp = hexdec($matches[3]);
        } else {
            $lead = hexdec($matches[1]);
            $trail = hexdec($matches[2]);

            // http://unicode.org/faq/utf_bom.html#utf16-4
            $cp = ($lead << 10) + $trail + 0x10000 - (0xD800 << 10) - 0xDC00;
        }

        // https://tools.ietf.org/html/rfc3629#section-3
        // Characters between U+D800 and U+DFFF are not allowed in UTF-8
        if ($cp > 0xD7FF && 0xE000 > $cp) {
            $cp = 0xFFFD;
        }

        // https://github.com/php/php-src/blob/php-5.6.4/ext/standard/html.c#L471
        // php_utf32_utf8(unsigned char *buf, unsigned k)

        if ($cp < 0x80) {
            return chr($cp);
        } else if ($cp < 0xA0) {
            return chr(0xC0 | $cp >> 6).chr(0x80 | $cp & 0x3F);
        }

        return html_entity_decode('&#'.$cp.';');
    }, $str);
}

2 Comments

Thank you. This seems to work with supplementary character, such as �?�
This seems to be the only answer that works for my usecase: decoding unicode escape sequences in a string. Thanks!
2

This is a sledgehammer approach to replacing raw UNICODE with HTML. I haven't seen any other place to put this solution, but I assume others have had this problem.

Apply this str_replace function to the RAW JSON, before doing anything else.

function unicode2html($str){
    $i=65535;
    while($i>0){
        $hex=dechex($i);
        $str=str_replace("\u$hex","&#$i;",$str);
        $i--;
     }
     return $str;
}

This won't take as long as you think, and this will replace ANY unicode with HTML.

Of course this can be reduced if you know the unicode types that are being returned in the JSON.

For example my code was getting lots of arrows and dingbat unicode. These are between 8448 an 11263. So my production code looks like:

$i=11263;
while($i>08448){
    ...etc...

You can look up the blocks of Unicode by type here: http://unicode-table.com/en/ If you know you're translating Arabic or Telegu or whatever, you can just replace those codes, not all 65,000.

You could apply this same sledgehammer to simple encoding:

 $str=str_replace("\u$hex",chr($i),$str);

Comments

1

There is also a solution:
http://www.welefen.com/php-unicode-to-utf8.html

function entity2utf8onechar($unicode_c){
    $unicode_c_val = intval($unicode_c);
    $f=0x80; // 10000000
    $str = "";
    // U-00000000 - U-0000007F:   0xxxxxxx
    if($unicode_c_val <= 0x7F){         $str = chr($unicode_c_val);     }     //U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
    else if($unicode_c_val >= 0x80 && $unicode_c_val <= 0x7FF){         $h=0xC0; // 11000000
        $c1 = $unicode_c_val >> 6 | $h;
        $c2 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2);
    } else if($unicode_c_val >= 0x800 && $unicode_c_val <= 0xFFFF){         $h=0xE0; // 11100000
        $c1 = $unicode_c_val >> 12 | $h;
        $c2 = (($unicode_c_val & 0xFC0) >> 6) | $f;
        $c3 = ($unicode_c_val & 0x3F) | $f;
        $str=chr($c1).chr($c2).chr($c3);
    }
    //U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x10000 && $unicode_c_val <= 0x1FFFFF){         $h=0xF0; // 11110000
        $c1 = $unicode_c_val >> 18 | $h;
        $c2 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c3 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c4 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4);
    }
    //U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x200000 && $unicode_c_val <= 0x3FFFFFF){         $h=0xF8; // 11111000
        $c1 = $unicode_c_val >> 24 | $h;
        $c2 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c3 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c4 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c5 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5);
    }
    //U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    else if($unicode_c_val >= 0x4000000 && $unicode_c_val <= 0x7FFFFFFF){         $h=0xFC; // 11111100
        $c1 = $unicode_c_val >> 30 | $h;
        $c2 = (($unicode_c_val & 0x3F000000)>>24) | $f;
        $c3 = (($unicode_c_val & 0xFC0000)>>18) | $f;
        $c4 = (($unicode_c_val & 0x3F000) >>12) | $f;
        $c5 = (($unicode_c_val & 0xFC0) >>6) | $f;
        $c6 = ($unicode_c_val & 0x3F) | $f;
        $str = chr($c1).chr($c2).chr($c3).chr($c4).chr($c5).chr($c6);
    }
    return $str;
}
function entities2utf8($unicode_c){
    $unicode_c = preg_replace("/\&\#([\da-f]{5})\;/es", "entity2utf8onechar('\\1')", $unicode_c);
    return $unicode_c;
}

Comments

1

fix json values, it's add \ before u{xxx} to all +" "

  $item = preg_replace_callback('/"(.+?)":"(u.+?)",/', function ($matches) {
        $matches[2] = preg_replace('/(u)/', '\u', $matches[2]);
            $matches[2] = preg_replace('/(")/', '&quot;', $matches[2]); 
            $matches[2] = json_decode('"' . $matches[2] . '"'); 
            return '"' . $matches[1] . '":"' . $matches[2] . '",';
        }, $item);

Comments

1

There is a very simple and beautiful solution.

If we want to decode Unicode escape sequences like "\u00ed" to "í" we may use simple function json_decode:

$a="\u00ed";
echo json_decode("\"$a\"");
# output: í

It works because json_encode encodes all non utf-8 symbols to \u**** sequence:

echo json_encode("í");
# output: "\u00ed"

This is a little continuation of the solution https://stackoverflow.com/a/7981441/5599052.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.