Change character units from UTF-16 code unit to Unicode codepoint #376
Comments
|
I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint. |
|
@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support. If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces. Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways. |
Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16. |
Citation needed? ;) Of the 7 downstream language completers we support in ycmd:
* full disclosure, I think these use code points, else we have a bug! The last is a bit of a fib, because we're integrating Language Server API for java. However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units. Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated. Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p |
|
Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion
|
|
Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints. https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657 @vibhavp for Emacs lsp-mode internal representation |
|
I am sorry in advance if I am telling something stupid right now. I have a question to you guys. My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?
That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes? Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems. @dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints. |
Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares. This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever. |
|
Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open. The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable. Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion). |
|
I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.
I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to |
|
This is causing us some implementation difficulty in clangd, which needs to interop with external indexes. |
|
Yup, same problem here working on reproto/reproto#34. This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description. |
|
As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:
If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients. I would appreciate a PR for this that for example does the following:
|
|
What about using byte indices directly? Using codepoints still requires to go through every single character. |
|
@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice: Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches. Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported. |
|
I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.
In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs. As far as I know, Unicode defines three considerable concepts of text assemblies.
And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code. As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion. Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them. There will be two distinctive position/offset counting mode (1)
In LSP3, servers should support both of If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec. |
|
@eonil Regarding grapheme clusters: The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience. |
|
@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used. |
|
For pity's sake @jayadevanraja do not waste time arguing that UTF-8 is anything less than the best standard we have for text in technology today, given its ubiquity, overall performance, simplicity, acceptance of variable chars no worse than UTF-16, and its ever-growing widespread usage. Even if another encoding was "better", it's completely off-point, and isn't worth maintaining an inconsistent UTF-16 character unit for an LSP field when the text is already transmitted in UTF-8. |
|
I think fixing that mistake and to specify UTF-8 use would be a perfect addition to the new LSP standard. Sadly, seems most of the implementors of the language servers don't care. |
It's not really wrong. Sure, Chinese is more optimal than, say, Japanese, however according to https://hsivonen.fi/string-length/ Universal Declaration of Human Rights in Japanese encoded in UTF-8 is only 19% larger than the Universal Declaration of Human Rights in English (10227 bytes vs 8565 bytes). If I multiply 8565 by 3 I get 25695 bytes, so clearly some of the 1-vs-3 cancels out there as 25695 bytes is clearly much more than 10227 bytes. It uses less bytes than Dutch for example at the very least. Additionally, worth noting that Twitter has 140 character limit for Japanese text, but it's 280 for English text. This is because Japanese characters convey much more information than English characters, see https://blog.twitter.com/en_us/topics/product/2017/Giving-you-more-characters-to-express-yourself.html.
Programs typically end up having a lot of Latin characters. Keywords, operators, standard library function names are all written in Latin script. Even if you write all your variable names, function names and comments in, say, Korean, there will be still a lot of symbols using Latin script. Note that Language Server Protocol is designed for programming languages, not regular documents. |
|
@xfix |
|
My points are these (none of which are refuted). By UTF-16, I mean compact UTF-16 (as mentioned JEP254)
TLDR Compact UTF-16 is more compact and has a larger code-space. What is the reason some people want to use UTF-8 instead of compact UTF-16 (as mentioned JEP254)? |
|
Compact UTF-16 is a system boundary's encoding designed to never leave the system it is used in which means it's a bad choice for a standard designed to work across programs like Language Server Protocol. In a way it's comparable to other system bounary's encodings like CESU-8, MUTF-8, WTF-8, PEP 383 and Perl 5's Also:
If you directly put unpaired surrogates into your source files that's your problem. This is not valid UTF-16 and most text editors won't handle that. If you need an unpaired surrogate in a string, write |
Is this literally a shitpost thread now? :D |
If we compare improper UTF-16, we should compare it with improper UTF-8. Improper UTF-8 can represent every integer between 0 and |
|
@jayadevanraja Please stop it. |
I think that this sort of profanity should have been avoided. You did that after quoting a sentence out of context. I have gone through this thread, and I find most of his points are not answered. Calling all Asian languages as Chinese and Japanese is also not in the right spirit. |
@Kalyani-Varma As far as I know, this term is the standard terminology for the phenomena in all circles (including FOSS circles) it occurs. But if you have a concrete better alternative I am always happy to learn. I don't think the context in which this sentence occurs (which is not hard to find w.r.t my comment) explains why it is relevant to the topic at hand, which is offset measurement (not compactness of encoding). |
|
@Kalyani-Varma |
|
The (I mentioned this extension above, but it may have gotten lost in the contentious discussion :-) |
|
Since Microsoft was historically one of the major ecosystems which use UTF-16, it's perhaps worth mentioning that Microsoft is recommending UTF-8 now for UWP applications, and is making it possible to use native windows APIs with UTF-8: https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page I hesitate to post this because |
Where can I find the documentation for this? I am looking at the LSP 3.17 specification and it only mentions UTF-16. |
|
It is currently only part of the LSIF specification (see |
|
rust-analyzer now implements the |
|
Does VSCode not support UTF-8 at all? |
|
@ViliamVadocz I think so; it's now Microsoft's job to support this and (preferably) also deprecate/migrate away from their own proprietary encoding. |
|
Are there plans to generalize the |
Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.
Here in
TextDocumentContentChangeEvent,rangeis specified in UTF-16 column offsets whiletextis transmitted in UTF-8.Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.
jacobdufault/cquery#57
Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0
The text was updated successfully, but these errors were encountered: