Workaround

Question

3

I'm using Python (Python 2.5.2 on Ubuntu 8.10) to parse JSON from (ASCII encoded) text files. When loading these files with json (simplejson), all my string values are cast to Unicode objects instead of string objects.

The problem is, I have to use the data with some libraries that only accept string objects.

Is it possible to get string objects instead unicode ones from simplejson?
Any hints on how I can achieve this automatically?

Edit: I can't change the libraries nor update them. One - the csv module - is even in the Python standard library (the documentation says it will support Unicode in the future). I could write wrappers of course, but maybe there is a more convenient way?

The actual data I parse from the JSON files is rather nested and complex, so it would be a pain to look for every Unicode object therein and cast it manually...

Here's a small example:

>>> import simplejson as json
>>> l = ['a', 'b']
>>> l
['a', 'b']
>>> js = json.dumps(l)
>>> js
'["a", "b"]'
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']

Update: I completely agree with Jarret Hardie and nosklo: Since the JSON specs specifically state strings as Unicode simplejson should return Unicode objects.

But while searching the net, I came across some post, where people complain about simplejson actually returning string objects... I couldn't reproduce this behavior but it seems it is possible. Any hints?

Workaround

Right now I use PyYAML to parse the files, it gives me string objects.
Since JSON is a subset of YAML it works nicely.

edited Jun 6 at 11:51

asked Jun 5 at 16:32

Brutus
175●1●7

By the way, why don't you use unicode internally? :) – NicDumZ Jun 6 at 11:03

1

@NicDumZ: I would love to ;) But as I said, the libraries I use don't support Unicode. Working around Unicode sadly is more convenient (in this case) than replacing or wrapping the libraries. – Brutus Jun 6 at 11:20

Jarret Hardie · Accepted Answer · 2009-06-05 18:10:03Z

I'm afraid there's no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

nosklo · Answer 2 · 2009-06-05 18:29:06Z

That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

How to get string Objects instead Unicode ones from JSON in Python?

Workaround

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python json unicode yaml serialization or ask your own question.

Hello World!

Related