Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I wrote a code to connect to imap and then parse the body information and insert into database. But I am having some problems with accents.

From email header I got this information:

Content-Type: text/html; charset=ISO-8859-1

But, I am not sure if I can trust in this information...

The email was wrote in portuguese, so we have a lot of words with accents. For example, I extract the following phrase from the email source code (using my browser):

"...instalação de eletrônicos..."

So, I connected to imap and fetched some emails:

... typ, data = M.fetch(num, '(RFC822)') ...

When I print the content, I get the following word:

print data[0][1]
instala+º+úo de eletr+¦nicos

I tried to use .decode('utf-8') but I had no success.

instalação de eletrônicos

How can I make it a human readable? My database is in utf-8.

share|improve this question
    
Python2 or Python3? –  Winston Ewert Feb 11 '13 at 18:18
1  
What does print(type(data[0][1])); print(repr(data[0][1])) print? –  Martijn Pieters Feb 11 '13 at 18:18
    
@WinstonEwert - Python 2.7 –  Thomas Feb 11 '13 at 18:25
    
@MartijnPieters - type: <type 'str'> and "print(repr(" returned accents with the following format: fun\xc3\xa7\xc3\xa3o (sorry, this is another accented word) –  Thomas Feb 11 '13 at 18:27
1  
No, that's exactly what I wanted to see. That's função in UTF8. And .decode('utf8') should work, perhaps you need to show us more code? –  Martijn Pieters Feb 11 '13 at 18:27

3 Answers 3

Thanks for Martijn Pieters. We figured out that the email had two different encode. I had to split this parts and treat individually.

share|improve this answer

Specifying the source code encoding worked for me. It's the code at the top of my example code below. This should be defined at the top of your python file.

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

value = """...instalação de eletrônicos...""".decode("iso-8859-15")
print value
# prints: ...instalação de eletrônicos...

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii','ignore')
print value
# prints: ...instalacao de eletronicos...

And now you can do str(value) without an exception as well.

See: http://docs.python.org/2/library/unicodedata.html

This seems to keep all accents:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
value = """...instalação de eletrônicos...""".decode("iso-8859-15")
value = unicodedata.normalize('NFKC', value).encode('utf-8')
print value
print str(value)

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

EDIT:

Do note that with the last version even though the outcome looks the same it doesn't return equal is True. In example:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
inValue = """...instalação de eletrônicos...""".decode("iso-8859-15")
normalizedValue = unicodedata.normalize('NFKC', inValue).encode('utf-8')

try:
    print inValue == normalizedValue
except UnicodeWarning:
    pass
# False

EDIT2:

This returns the same:

normalizedValue = unicode("""...instalação de eletrônicos...""".decode("iso-8859-15")).encode('utf-8')
print normalizedValue 
print str(normalizedValue )

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

Though I'm not sure this will actually be valid for a utf-8 encoded database. Probably not?

share|improve this answer

The header says it is using "ISO-8859-1" charset. So you need to decode the string with that encoding.

Try this:

data[0][1].decode('iso-8859-1')
share|improve this answer
    
That would not lead to the double bytes seen by the OP. Let's see what my request for the type and repr of the data gives us, shall we? –  Martijn Pieters Feb 11 '13 at 18:26
    
It returned the following error: UnicodeEncodeError: 'charmap' codec can't encode character u'\x83' in position 20: character maps to <undefined> –  Thomas Feb 11 '13 at 18:29

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.