Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I currently put this at the top of all my .py files:

# -*- coding: utf-8 -*-

I've been taught this for years as best practice. To me the idea of enforcing UTF-8 by default makes sense, especially with my tests containing a lot of Unicode characters. It allows me to write Unicode literals in my code directly.

However, I recently was told that forcing the source encoding to UTF-8 can be bad for cross-platform compatibility, since Windows doesn't default to UTF-8. I believe it's not just an issue with code editors, but more of an issue with treating Unicode the same everywhere. But I don't fully understand the issue.

Both approaches seem to have strong arguments. In more detail, what are the benefits of enforcing/not enforcing a source encoding? What are the problems?

share|improve this question
    
the problem you may encounter is that a text editor will consider that encoding should be cp1252 and save your file using that encoding. But any proper python editor should read the coding header. –  njzk2 20 hours ago

2 Answers 2

I'm not sure I know exactly what compatibility issues you mean, but you seem to be conflating two separate issues. One thing is: When you actually type characters into your source file, they are encoded using a certain encoding, which is determined by your text editor and/or operating system settings. Another thing is: when Python reads your source file, it interprets what it finds according to a certain encoding, and that is what your *-* coding declaration tells it.

Just because you write # -*- coding: utf-8 -*- at the top of your file doesn't mean your file actually is in UTF-8. That encoding declaration does not "enforce" anything; it just tells Python to assume that the file is in UTF-8.

As a parallel, imagine receiving a document that says at the top "This document is written in Croatian". Upon reading this, you might go get a Croatian dictionary to help you understand the document. However, just because it says that at the top doesn't mean the document actually is in Croatian; anyone can take a document written in Albanian or some other language and write "This document is written in Croatian" at the top --- and in fact, they might do so, if they were unfamiliar with both languages and didn't know how to tell the difference.

Similarly, if you use a text editor that isn't Unicode-aware, it may blithely insert non-UTF8 characters into the file, even though you wrote "coding: utf-8" at the top. This will cause problems if you later try to run the file, because Python will think it is in UTF-8 even though it really isn't.

UTF-8 is still the best encoding to use. The only thing is you should make sure that your editor is set up so it really is encoding your files in UTF-8.

It's also possible that if someone else gets your code and makes modifications, they could be using an editor that's not using UTF-8, which would likewise cause problems if their editor put non-UTF-8 stuff into the file. This means that if you're sharing code with other people (e.g., you're part of a team developing software), you should all agree on an encoding and use it consistently. It is conceivable that you could be part of an organization that has a policy of using some encoding other than UTF-8 (say, Latin-1), in which case you'd have to set your editor to use that encoding. However, more and more, organizations big enough to share code extensively among different people are realizing that everyone should always be using UTF-8 all the time.

(Someone who downloads your code off the internet and tries to modify it can also run into the same encoding problems, but if your file is in UTF-8 and has the UTF-8 encoding declaration, then it's self documenting. If someone else messes it up with another encoding, that's their own fault for not paying attention. You only need to worry about such problems insofar as you actually care about collaborating with others; you can't and shouldn't worry about the myriad mistakes that random people on the internet might make if they come across your code.)

share|improve this answer

Many code editors will not understand your coding declaration. And, on Windows, many of them will default to using your configured code page instead of the UTF-8. And, worse if you edit the mojibake'd code and save it, it'll get double-mojibake'd, and it'll be horribly misleading—you'll have CP1252 text that claims to be UTF-8.

So, that's bad.

But leaving off the coding declaration just makes things worse. Then, even the better editors (that do read coding declarations) will get your code wrong. And, worse, the Python interpreter will get your code wrong!

You can, of course, write all of your code (including any string literals) in nothing but ASCII, using Unicode escape sequences when necessary. The upside is that you can avoid all encoding-related issues with your source code (well, as long as you stick to ASCII-compatible encodings, but since current versions of Python don't even run on any EBCDIC machines or ZX81s or whatever, you can probably ignore that). The downside is that it can be a lot less readable for some kinds of code (e.g., code whose main job is to build text out of mail-merge templates in Czech won't be pretty if those templates are written as string literals with Unicode escapes).

Anyway, if you stick to all ASCII, then yes, adding a coding declaration is probably a bad idea (because it may mislead your or other maintainers into thinking they can safely insert non-ASCII characters, which you were deliberately avoiding doing). But otherwise, it's absolutely necessary.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.