Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have some json I need to decode, alter and then encode without messing up any characters.

If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or-\-or- control-character. But it doesn't work in python either.

{"Tag":"Odómetro"}

I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.

[Tag] => Odómetro

When I encode the array again I the character escaped to ascii, which is correct according to the json spec:

"Tag"=>"Od\u00f3metro"

Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.

Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.

$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...
share|improve this question
What character set is your input in? UTF-8? Something else? – John Flatness Sep 11 '11 at 23:13
Yes, UTF8...... – Keyo Sep 11 '11 at 23:15
3  
JSON_UNESCAPED_UNICODE is new in PHP 5.4 (i.e., it doesn't exist yet). – John Flatness Sep 11 '11 at 23:16
If you're already working with UTF-8, then you definitely don't want to use utf8_encode, which is designed to convert from ISO 8859-1 to UTF-8. Is this string coming from a database, a string literal, or some other source? (Reason for all the questions: json_encode is specifically built to only work with UTF-8 strings). – John Flatness Sep 11 '11 at 23:19
It's coming from a postgres database, which has encoding set to UTF-8. I'm not sure why it does not parse, python won't parse it either though. – Keyo Sep 11 '11 at 23:24
add comment (requires an account with 50 reputation)

5 Answers

up vote 6 down vote accepted

Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.

Here's why I think so:

  • json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
  • You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
  • Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
  • You said you had the same problem in Python, which would seem to exclude PHP from being the issue.

PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.

So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).

Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).

share|improve this answer
Thanks for the detail in your answer. It seems correct since Japanese characters are converted to \uXXXX even with htmlentities($item, NULL, 'UTF-8');. So I can only assume the input string is not utf-8. I've just been testing this with a simple form for now, but it seems that the pg library uses ISO string encoding. – Keyo Sep 12 '11 at 1:17
add comment (requires an account with 50 reputation)

JSON_UNESCAPED_UNICODE was added in PHP 5.4 so it looks like you need upgrade your version of PHP to take advantage of it. 5.4 is not released yet though! :(

There is a 5.4 alpha release candidate on QA though if you want to play on your development machine.

share|improve this answer
Unfortunately I'm stuck with 5.3, but I've found a way around this in my Answer below. – Keyo Sep 12 '11 at 0:24
add comment (requires an account with 50 reputation)

A hacky way of doing JSON_UNESCAPED_UNICODE in PHP 5.3. Really disappointed by PHP json support. Maybe this will help someone else.

$array = some_json();
// Encode all string children in the array to html entities.
array_walk_recursive($array, function(&$item, $key) {
    if(is_string($item)) {
        $item = htmlentities($item);
    }
});
$json = json_encode($array);

// Decode the html entities and end up with unicode again.
$json = html_entity_decode($rson);
share|improve this answer
3  
This will only reliably work if your strings in $array are (and I do hate to keep banging the same drum) encoded with ISO 8859-1. This is, in effect, a convoluted way of converting from ISO 8859-1 to UTF-8. This would have the effect of giving you JSON without the Unicode escape sequences, but if your input strings are UTF-8, you need to pass 'UTF-8' as the charset parameter to htmlentities for this to work. – John Flatness Sep 12 '11 at 0:54
add comment (requires an account with 50 reputation)

try setting the utf-8 encoding in your page:

header('content-type:text/html;charset=utf-8');

this works for me:

$arr = array('tag' => 'Odómetro');
$encoded = json_encode($arr);
$decoded = json_decode($encoded);
echo $decoded->{'tag'};
share|improve this answer
Good tip, but that's only half the problem. I need to encode it properly. This won't even be displayed it's purely for data manipulation. – Keyo Sep 11 '11 at 23:20
add comment (requires an account with 50 reputation)

Try Using:

utf8_decode() and utf8_encode
share|improve this answer
add comment (requires an account with 50 reputation)

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.