Encoding error using Python

Question

I wrote a code to connect to imap and then parse the body information and insert into database. But I am having some problems with accents.

From email header I got this information:

Content-Type: text/html; charset=ISO-8859-1

But, I am not sure if I can trust in this information...

The email was wrote in portuguese, so we have a lot of words with accents. For example, I extract the following phrase from the email source code (using my browser):

"...instalação de eletrônicos..."

So, I connected to imap and fetched some emails:

... typ, data = M.fetch(num, '(RFC822)') ...

When I print the content, I get the following word:

print data[0][1]

instala+º+úo de eletr+¦nicos

I tried to use .decode('utf-8') but I had no success.

instalaÃ§Ã£o de eletrÃ´nicos

How can I make it a human readable? My database is in utf-8.

What does print(type(data[0][1])); print(repr(data[0][1])) print? — Martijn Pieters♦, Feb 11 '13 at 18:18
@MartijnPieters - type: <type 'str'> and "print(repr(" returned accents with the following format: fun\xc3\xa7\xc3\xa3o (sorry, this is another accented word) — Thomas, Feb 11 '13 at 18:27
No, that's exactly what I wanted to see. That's função in UTF8. And .decode('utf8') should work, perhaps you need to show us more code? — Martijn Pieters♦, Feb 11 '13 at 18:27

Thomas · Answer 1 · 2013-02-12 02:14:03Z

up vote 0 down vote

Thanks for Martijn Pieters. We figured out that the email had two different encode. I had to split this parts and treat individually.

answered Feb 12 '13 at 2:14

Thomas
73111127

add a comment |

Roy Nieterau · Answer 2 · 2013-02-11 20:59:32Z

Specifying the source code encoding worked for me. It's the code at the top of my example code below. This should be defined at the top of your python file.

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

value = """...instalação de eletrônicos...""".decode("iso-8859-15")
print value
# prints: ...instalação de eletrônicos...

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii','ignore')
print value
# prints: ...instalacao de eletronicos...

And now you can do str(value) without an exception as well.

See: http://docs.python.org/2/library/unicodedata.html

This seems to keep all accents:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
value = """...instalação de eletrônicos...""".decode("iso-8859-15")
value = unicodedata.normalize('NFKC', value).encode('utf-8')
print value
print str(value)

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

EDIT:

Do note that with the last version even though the outcome looks the same it doesn't return equal is True. In example:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
inValue = """...instalação de eletrônicos...""".decode("iso-8859-15")
normalizedValue = unicodedata.normalize('NFKC', inValue).encode('utf-8')

try:
    print inValue == normalizedValue
except UnicodeWarning:
    pass
# False

EDIT2:

This returns the same:

normalizedValue = unicode("""...instalação de eletrônicos...""".decode("iso-8859-15")).encode('utf-8')
print normalizedValue 
print str(normalizedValue )

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

Though I'm not sure this will actually be valid for a utf-8 encoded database. Probably not?

marianobianchi · Answer 3 · 2013-02-11 18:23:43Z

up vote 0 down vote

The header says it is using "ISO-8859-1" charset. So you need to decode the string with that encoding.

Try this:

data[0][1].decode('iso-8859-1')

answered Feb 11 '13 at 18:23

marianobianchi
2,188822

That would not lead to the double bytes seen by the OP. Let's see what my request for the type and repr of the data gives us, shall we? – Martijn Pieters♦ Feb 11 '13 at 18:26

It returned the following error: UnicodeEncodeError: 'charmap' codec can't encode character u'\x83' in position 20: character maps to <undefined> – Thomas Feb 11 '13 at 18:29

add a comment |

asked	2 years ago
viewed	505 times
active	2 years ago

current community

your communities

more stack exchange communities

Encoding error using Python

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python encoding character-encoding accented-strings or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Encoding error using Python

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python encoding character-encoding accented-strings or ask your own question.

Related

Hot Network Questions