Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I see that the Python manual mentions .encode() and .decode() string methods. Playing around on the Python CLI I see that I can create unicode strings u'hello' with a different datatype than a 'regular' string 'hello' and can convert / cast with str(). But the real problems start when using characters above ASCII 127 u'שלום' and I am having a hard time determining empirically exactly what is happening.

Stack Overflow is overflowing with examples of confusion regarding Python's unicode and string-encoding/decoding handling.

What exactly happens (how are the bytes changed, and how is the datatype changed) when encoding and decoding strings with the str() method, especially when characters that cannot be represented in 7 bytes are included in the string? Is it true, as it seems, that a Python variable with datatype <type 'str'> can be both encoded and decoded? If it is encoded, I understand that means that the string is represented by UTF-8, ISO-8859-1, or some other encoding, is this correct? If it is decoded, what does this mean? Are decoded strings unicode? If so, then why don't they have the datatype <type 'unicode'>?

In the interest of those who will read this later, I think that both Python 2 and Python 3 should be addressed. Thank you!

share|improve this question
2  
Python 3 does not have any of these issues: str can only be encoded and bytes can only be decoded. – R. Martinho Fernandes 16 hours ago

2 Answers

up vote 1 down vote accepted

This is only the case in Python 2. The existence of a decode method on Python 2's strings is a wart, which has been changed in Python 3 (where the equivalent, bytes, has only decode).

You can't 'encode' an already-encoded string. What happens when you do call encode on a str is that Python implicitly calls decode on it using the default encoding, which is usually ASCII. This is almost always not what you want. You should always call decode to convert a str to unicode before converting it to a different encoding.

(And decoded strings are unicode, and they do have type <unicode>, so I don't know what you mean by that question.)

In Python 3 of course strings are unicode by default. You can only encode them to bytes - which, as I mention above, can only be decoded.

share|improve this answer
Thank you Daniel. I think that I may be best off porting to Python 3 and being done with it. I find the implicit decoding done in Python to be not only very 'unpythonic' (explicit is better than implicit) but also very confusing as the developer is not aware that such a conversion has taken place. Plus, it is decoding using the wrong encoding! – dotancohen 16 hours ago

Let me just highlight some points that are important to understand when working with natural languages in programming:

  • Computers, devices and networks have no idea of "characters".
  • A "character" is something only we humans are able to understand.
  • All you get from or put to a computer or a network is a sequence of bytes.
  • What exactly these bytes represent is up to you and must be defined in your code.
  • A "string" is a mathematical abstraction, there is no way computers can store "strings".
  • A sequence of bytes can represent a "string", among other things.
  • The same string can be represented in different ways.
  • These ways are called "encodings".
  • In python, to convert bytes to a string you use bytes.decode(name-of-encoding).
  • To convert a string to a sequence of bytes you use string.encode(name-of-encoding).
  • The "bytes" type is called str in python2 and bytes in python3.
  • The "string" type is called unicode in python2 and str in python3.

Two must-read resources:

share|improve this answer
Thank you. I've read Joel's article as recently as yesterday! I just read Ned's article a few minutes ago. Your bullet points are very helpful, especially the last two! – dotancohen 15 hours ago
@dotancohen: irony? Did my post offend you? If yes, how exactly? – thg435 15 hours ago
No irony! I'm not offended! Why do you think that? I feel that your post was genuinely very helpful. Thank you! – dotancohen 13 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.