Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change character units from UTF-16 code unit to Unicode codepoint #376

Open
MaskRay opened this issue Jan 13, 2018 · 143 comments
Open

Change character units from UTF-16 code unit to Unicode codepoint #376

MaskRay opened this issue Jan 13, 2018 · 143 comments

Comments

@MaskRay
Copy link

@MaskRay MaskRay commented Jan 13, 2018

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
	range?: Range;
	rangeLength?: number;
	text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

jacobdufault/cquery#57

Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0

@MaskRay MaskRay changed the title Change character offset calculation from UTF-16 to UTF-8 Change character units from UTF-16 code unit to Unicode codepoint Jan 13, 2018
@szatanjl
Copy link

@szatanjl szatanjl commented Jan 18, 2018

I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint.

@dbaeumer
Copy link
Member

@dbaeumer dbaeumer commented Jan 18, 2018

@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use utf-8 here but as the header indicates this can be change to any encoding assuming that the encoding is supported in all libraries (for example node per default as only a limited set on encodings).

The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways.

@Marwes
Copy link
Contributor

@Marwes Marwes commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16.

@puremourning
Copy link

@puremourning puremourning commented Jan 18, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

Citation needed? ;)

Of the 7 downstream language completers we support in ycmd:

  • 1 uses byte offsets (libclang)
  • 6 use unicode code points (gocode, tern, tsserver*, jedi, racer, omnisharp*)
  • 0 use utf 16 code units

* full disclosure, I think these use code points, else we have a bug!

The last is a bit of a fib, because we're integrating Language Server API for java.

However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units.

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated.

Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p

@puremourning
Copy link

@puremourning puremourning commented Jan 19, 2018

Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

@MaskRay
Copy link
Author

@MaskRay MaskRay commented Jan 19, 2018

Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints.

https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657

@vibhavp for Emacs lsp-mode internal representation

@szatanjl
Copy link

@szatanjl szatanjl commented Jan 19, 2018

I am sorry in advance if I am telling something stupid right now. I have a question to you guys.

My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets.

That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes?

Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems.

@dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints.

@eonil
Copy link

@eonil eonil commented Jan 22, 2018

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares.

This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever.

@kdvolder
Copy link

@kdvolder kdvolder commented Jan 22, 2018

Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open.

The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable.

Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion).

@eonil
Copy link

@eonil eonil commented Jan 23, 2018

I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.

I didn't mention technical reasons because these are mainly repetition of other people's opinion or well known. Anyway, I list my technical reasons here.

  • IMO, UTF-8 is present and future, and UTF-16 is legacy to avoid. The reason is here.
  • By requiring dependency to UTF-16, LSP effectively forces program implementation to involve the legacy.
  • Simplicity is better than extra complexity and dependency. One encoding for everywhere is better.
  • More complexity and dependency increases amount of work of implementation a lot.
  • AFAIK, Converting indices between different Unicode encodings are very expensive.
  • LSP is a new protocol. No reason to involve a bad legacy. The only benefit here is potential benefit to specific platforms with native UTF-16 native strings.
  • For now, the only reason to require UTF-16 is to give such benefit to specific implementations.
  • Other platforms wouldn't be very happy due to increased complexity and potential performance penalty in implementation.
  • Such unfair benefit is likely going to break community.

... or needs to be a capability the client announces

I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to utf-8, interpret the numbers as UTF-8 code points, and if it is utf-16, interpret them as UTF-16 code point. If the field is missing, fallback to UTF-16 for legacy compatibility.

@sam-mccall
Copy link
Contributor

@sam-mccall sam-mccall commented Apr 13, 2018

This is causing us some implementation difficulty in clangd, which needs to interop with external indexes.
Using UTF-16 on network protocols is rare, so requiring indexes to provide UTF-16 column numbers is a major yak-shave and breaks abstractions.

@udoprog
Copy link

@udoprog udoprog commented Apr 13, 2018

Yup, same problem here working on reproto/reproto#34.

This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description.

@dbaeumer
Copy link
Member

@dbaeumer dbaeumer commented Apr 27, 2018

As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:

  • client announces the encodings it supports for position encodings.
  • server picks an encoding to use.

If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients.

I would appreciate a PR for this that for example does the following:

@jclc
Copy link

@jclc jclc commented May 13, 2018

What about using byte indices directly? Using codepoints still requires to go through every single character.

@udoprog
Copy link

@udoprog udoprog commented May 13, 2018

@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice:

Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them.
In order to be backwards compatible the exchange has to be bi-directional. All servers have to support UTF-16 and fall back to this when the client indicates that this is their only capability. At least until a new major revision of the LSP has been widely adopted and the old one deprecated.

Using codepoints still requires to go through every single character.

This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches.

Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported.

@eonil
Copy link

@eonil eonil commented May 14, 2018

I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.

Anyway, LSP3 spec defines "character offset" as UTF-16 Code Unit, which means it implicitly defines term "character" as UTF-16 Code Unit as well. This is (1) non-sense as Unicode Code Unit is not intended to be a character and (2) inconsistent with other part where UTF-8 based.

In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs.


As far as I know, Unicode defines three considerable concepts of text assemblies.

  • Code Unit
  • Code Point
  • Grapheme Cluster

And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code.

As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion.

Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them.

There will be two distinctive position/offset counting mode (1) legacy-compatible and (2) grapheme-cluster-counting.

  • legacy-compatible mode is same with current. Defines "character" as UTF-16 Code Unit.
  • grapheme-cluster-counting mode defines "character" as "Grapheme Cluster" and uses count of Grapheme Cluster as position offset.

In LSP3, servers should support both of legacy-compatible(deprecated but default) and grapheme-cluster-counting modes.
In LSP4, grapheme-cluster-counting is the only counting method.


If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec.

@udoprog
Copy link

@udoprog udoprog commented May 14, 2018

@eonil Regarding grapheme clusters:

The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience.

@eonil
Copy link

@eonil eonil commented May 14, 2018

@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used.

@jeffvandyke
Copy link

@jeffvandyke jeffvandyke commented Oct 27, 2020

For pity's sake @jayadevanraja do not waste time arguing that UTF-8 is anything less than the best standard we have for text in technology today, given its ubiquity, overall performance, simplicity, acceptance of variable chars no worse than UTF-16, and its ever-growing widespread usage. Even if another encoding was "better", it's completely off-point, and isn't worth maintaining an inconsistent UTF-16 character unit for an LSP field when the text is already transmitted in UTF-8.

@XVilka
Copy link

@XVilka XVilka commented Oct 27, 2020

I think fixing that mistake and to specify UTF-8 use would be a perfect addition to the new LSP standard. Sadly, seems most of the implementors of the language servers don't care.

@xfix
Copy link
Contributor

@xfix xfix commented Oct 27, 2020

@rpjohnst

* Asian text tends to use fewer characters to express the same information content, so some of the 1-vs-3 cancels out there.

All Asians are not Chinese. Your statement is wrong for most Asian scripts.

It's not really wrong. Sure, Chinese is more optimal than, say, Japanese, however according to https://hsivonen.fi/string-length/ Universal Declaration of Human Rights in Japanese encoded in UTF-8 is only 19% larger than the Universal Declaration of Human Rights in English (10227 bytes vs 8565 bytes). If I multiply 8565 by 3 I get 25695 bytes, so clearly some of the 1-vs-3 cancels out there as 25695 bytes is clearly much more than 10227 bytes. It uses less bytes than Dutch for example at the very least.

Additionally, worth noting that Twitter has 140 character limit for Japanese text, but it's 280 for English text. This is because Japanese characters convey much more information than English characters, see https://blog.twitter.com/en_us/topics/product/2017/Giving-you-more-characters-to-express-yourself.html.

* Much text in practice includes a lot of ASCII-range markup. Notably for this thread, this includes both the JSON in the LSP itself, and many of the programming languages the LSP is used for.

Yes, but only because punctuations and numbers are considered to be "basic Latin". Asian languages are not to be put under disadvantage.

Programs typically end up having a lot of Latin characters. Keywords, operators, standard library function names are all written in Latin script. Even if you write all your variable names, function names and comments in, say, Korean, there will be still a lot of symbols using Latin script.

Note that Language Server Protocol is designed for programming languages, not regular documents.

@jayadevanraja
Copy link

@jayadevanraja jayadevanraja commented Oct 28, 2020

@xfix
For most Asian scripts, the characters are 3 times longer in UTF-8. Chinese or CJK (Han Unification) is used by Mandarin, Cantonese, Japanese, etc. But not by most scripts in the world. Most scripts are Brahmic family, Parso-Arabic family etc.

@jayadevanraja
Copy link

@jayadevanraja jayadevanraja commented Oct 28, 2020

My points are these (none of which are refuted). By UTF-16, I mean compact UTF-16 (as mentioned JEP254)

  1. For individual strings, UTF-16 is much more compact, in general. If we take UTF-16:
    (1.1) For ASCII source file, both UTF-16 and UTF-8 are the same.
    (1.2) For Latin-1 (West and central Europe), UTF-16 is half the size of UTF-8.
    (1.3) For most Asian scripts, the characters are 3 times longer in UTF-8. Chinese or CJK (Han Unification) is used by Mandarin, Cantonese, Japanese, etc. But not by most scripts in the world. Most scripts are Brahmic family, Parso-Arabic family etc.
    (1.4) Unicode (due to ASCII backward compatibility reason) clubs English characters with numbers and punctuations. Text in every language uses those. That is not the case with identifiers - usually identifiers disallow punctuation and spaces inside them.
  2. UTF-16 (including unpaired surrogates) can represent more code space than UTF-8. Those characters, although not a part of Unicode, arise in practice.
  3. There are miscellaneous benefits for UTF-16, like being far more compact for encoding binary.
  4. OP wanted to know if we can replace UTF-16 with code points, not with UTF-8.

TLDR

Compact UTF-16 is more compact and has a larger code-space. What is the reason some people want to use UTF-8 instead of compact UTF-16 (as mentioned JEP254)?

@xfix
Copy link
Contributor

@xfix xfix commented Oct 28, 2020

Compact UTF-16 is a system boundary's encoding designed to never leave the system it is used in which means it's a bad choice for a standard designed to work across programs like Language Server Protocol. In a way it's comparable to other system bounary's encodings like CESU-8, MUTF-8, WTF-8, PEP 383 and Perl 5's utf8 (not to be confused with UTF-8).

Also:

UTF-16 (including unpaired surrogates) can represent more code space than UTF-8. Those characters, although not a part of Unicode, arise in practice.

If you directly put unpaired surrogates into your source files that's your problem. This is not valid UTF-16 and most text editors won't handle that. If you need an unpaired surrogate in a string, write "\uD800" instead.

@bfredl
Copy link

@bfredl bfredl commented Oct 28, 2020

(1.1) For ASCII source file, both UTF-16 and UTF-8 are the same.

Is this literally a shitpost thread now? :D

@bfredl
Copy link

@bfredl bfredl commented Oct 28, 2020

UTF-16 (including unpaired surrogates) can represent more code space than UTF-8. Those characters, although not a part of Unicode, arise in practice.

If we compare improper UTF-16, we should compare it with improper UTF-8. Improper UTF-8 can represent every integer between 0 and 2^(31)-1. Including overlong zero if you need a unicode 0x0000 where ASCII NUL is verboten.

@soc
Copy link

@soc soc commented Oct 28, 2020

@jayadevanraja Please stop it.

@Kalyani-Varma
Copy link

@Kalyani-Varma Kalyani-Varma commented Dec 3, 2020

Is this literally a shitpost thread now?

I think that this sort of profanity should have been avoided. You did that after quoting a sentence out of context.

I have gone through this thread, and I find most of his points are not answered. Calling all Asian languages as Chinese and Japanese is also not in the right spirit.

@bfredl
Copy link

@bfredl bfredl commented Dec 3, 2020

I think that this sort of profanity should have been avoided. You did that after quoting a sentence out of context.

@Kalyani-Varma As far as I know, this term is the standard terminology for the phenomena in all circles (including FOSS circles) it occurs. But if you have a concrete better alternative I am always happy to learn.

I don't think the context in which this sentence occurs (which is not hard to find w.r.t my comment) explains why it is relevant to the topic at hand, which is offset measurement (not compactness of encoding).

@jayadevanraja
Copy link

@jayadevanraja jayadevanraja commented Dec 4, 2020

@Kalyani-Varma
Thanks for the kind words. I had left the discussion when the tone changed from that of a scholarly and friendly discussion to a belligerent tone.

@sam-mccall
Copy link
Contributor

@sam-mccall sam-mccall commented Jan 26, 2021

The offsetEncoding extension for UTF-8/32 support is implemented in NeoVim LSP and kak-lsp on the client side. On the server side, clangd supports it, and rust-analyzer folks are working on it.

(I mentioned this extension above, but it may have gotten lost in the contentious discussion :-)

@rvolgers
Copy link

@rvolgers rvolgers commented Feb 1, 2021

Since Microsoft was historically one of the major ecosystems which use UTF-16, it's perhaps worth mentioning that Microsoft is recommending UTF-8 now for UWP applications, and is making it possible to use native windows APIs with UTF-8:

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

I hesitate to post this because offsetEncoding allows client and server to use whatever encoding is convenient for both, so the discussion about which encoding is best is not so important anymore. If anybody thinks my comment is not correct or a cause for more discussion, please let me know and I will delete it rather than cause more noise here.

@ViliamVadocz
Copy link

@ViliamVadocz ViliamVadocz commented Feb 3, 2021

I hesitate to post this because offsetEncoding allows client and server to use whatever encoding is convenient for both, so the discussion about which encoding is best is not so important anymore. If anybody thinks my comment is not correct or a cause for more discussion, please let me know and I will delete it rather than cause more noise here.

Where can I find the documentation for this? I am looking at the LSP 3.17 specification and it only mentions UTF-16.

@dbaeumer
Copy link
Member

@dbaeumer dbaeumer commented Feb 3, 2021

It is currently only part of the LSIF specification (see positionEncoding here https://microsoft.github.io/language-server-protocol/specifications/lsif/0.5.0/specification/#metaData). No work as been done yet to add it to 3.17 spec.

@matklad
Copy link
Contributor

@matklad matklad commented Feb 17, 2021

rust-analyzer now implements the offsetEncoding extension of clangd for UTF-8.

@ViliamVadocz
Copy link

@ViliamVadocz ViliamVadocz commented Feb 20, 2021

Does VSCode not support UTF-8 at all?
I have spent pretty much an entire day just trying to figure out how to make my language client give offsets in UTF-8. I am using the npm package vscode-languageclient and I couldn't even get it to send the offsetEncoding field as part of ClientCapabilities. Should I just switch IDEs to test UTF-8?

@soc
Copy link

@soc soc commented Feb 21, 2021

@ViliamVadocz I think so; it's now Microsoft's job to support this and (preferably) also deprecate/migrate away from their own proprietary encoding.

@DavidGoldman
Copy link

@DavidGoldman DavidGoldman commented Mar 17, 2021

Are there plans to generalize the offsetEncoding extension and implement support in VS Code? sourcekit-lsp also runs into this issue since Apple's index-store format uses utf-8 offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet