Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upEncoding problem #332
Encoding problem #332
Comments
Thanks for posting this issue ! It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected. Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string. |
In fact, no, |
More data: technically this is an encoding error in the data itself: In [11]: stream= pricing.odb.stream(tag.object.binsha)
In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <[email protected]> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n' You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info... |
Using the Even if parsing is made to work at some point, right now the tagger-name are expected to be |
What about using |
Great idea ! |
But that would be against your policy of handling as much as possible as |
BTW, that fixed my particular problem, but I guess you don't want the PR just yet... |
But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example. Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(. |
I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding. |
I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows: File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update Will this issue include my error or I need to create another one? Maybe you could help me with the solution? |
@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected. @CepGamer You can pass the |
I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown: ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
self.message = self.message.decode(self.encoding)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository. |
A new release was just made to pypi |
The |
I have run into this problem. This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug: https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f Using version 2.1.11 |
I have the same essue, when reading branches property, how to solve it? |
I'm not sure this is a proper way to use
TagReferences
, but it's definitely unexpected. This time I'm usingGitPython
installed by pypi.I have this nice tag:
I can get a lot of info out of it:
But this fails:
Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using
tag.object.hexsha
, which is what I wanted.