vote up 3 vote down star

I'm using Python (Python 2.5.2 on Ubuntu 8.10) to parse JSON from (ASCII encoded) text files. When loading these files with json (simplejson), all my string values are cast to Unicode objects instead of string objects.

The problem is, I have to use the data with some libraries that only accept string objects.

Is it possible to get string objects instead unicode ones from simplejson?
Any hints on how I can achieve this automatically?

Edit: I can't change the libraries nor update them. One - the csv module - is even in the Python standard library (the documentation says it will support Unicode in the future). I could write wrappers of course, but maybe there is a more convenient way?

The actual data I parse from the JSON files is rather nested and complex, so it would be a pain to look for every Unicode object therein and cast it manually...

Here's a small example:

>>> import simplejson as json
>>> l = ['a', 'b']
>>> l
['a', 'b']
>>> js = json.dumps(l)
>>> js
'["a", "b"]'
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']

Update: I completely agree with Jarret Hardie and nosklo: Since the JSON specs specifically state strings as Unicode simplejson should return Unicode objects.

But while searching the net, I came across some post, where people complain about simplejson actually returning string objects... I couldn't reproduce this behavior but it seems it is possible. Any hints?

Workaround

Right now I use PyYAML to parse the files, it gives me string objects.
Since JSON is a subset of YAML it works nicely.

flag
By the way, why don't you use unicode internally? :) – NicDumZ Jun 6 at 11:03
1  
@NicDumZ: I would love to ;) But as I said, the libraries I use don't support Unicode. Working around Unicode sadly is more convenient (in this case) than replacing or wrapping the libraries. – Brutus Jun 6 at 11:20

2 Answers

vote up 3 vote down check

I'm afraid there's no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

link|flag
vote up 4 vote down

That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']
link|flag
1  
what on earth has Java to do here? – Javier Jun 5 at 17:17
Thanks nosklo, that is what I have done first. But as I said, the real data I used is pretty nested and all, so this introduced quiety some overhead. I'm still looking for an automatic solution... There's at least one bug report out there where people complain about simplejson returning string objects instead of unicode. – Brutus Jun 5 at 17:23
@Javier: Sorry, I meant Javascript. Fixed the text in the answer. – nosklo Jun 5 at 18:12
1  
@Brutus: I think json is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects. What I mean is that json (javascript) strings can store any kind of unicode character, so it makes sense to create unicode objects when translating from json. You should really fix your libraries instead. – nosklo Jun 5 at 18:27

Your Answer

Get an OpenID
or

Not the answer you're looking for? Browse other questions tagged or ask your own question.