Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I was trying to open a file on Ubuntu in Python using:-

open('<unicode_string>', "wb")

unicode_string is '\u9879\u76ee\u7ba1\u7406'. It is a Chinese text. But I get the following error:-

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

I am trying to understand what limits this behavior? Went through some links. I understand that the OS's filesystem has the responsibility to encode the string. For windows the 'mbcs' encoding handles possibly every character. What could be the problem with linux.

  • Does not fail for all linux setups. What should I be checking?
share|improve this question
    
What is your locale (type that command, show its output)? Notably LANG and LC_ALL environment variables? –  Basile Starynkevitch Mar 23 at 20:53
    
just trying to understand how would locale affect this behavior? Can you please shed some light. Is it that the file system is trying to encode as per locale settings? –  Barun Sharma Mar 24 at 3:26
    
The error you get is from the Python layer. The open(2) syscall does not care about encoding. –  Basile Starynkevitch Mar 24 at 6:03
1  
mbcs does not handle every character, it would limit you to characters in the ANSI code page for that machine. However Python has special support for accepting pathnames as Unicode strings and passing those directly to Win32-specific APIs instead of using the C standard library calls with mbcs encoding. –  bobince Mar 24 at 11:41
2  
On non-Windows platforms, Unicode pathnames have to be converted to byte strings in the Python layer. Python uses the machine's locale information to try to work out what encoding is used for file system paths. On a modern Linux system that encoding should be UTF-8 to make everyone happy, but something about your environment is making Python think you are using Latin-1 for your filesystem, in which case storing U+9879 et al is indeed impossible. –  bobince Mar 24 at 11:44
show 4 more comments

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.