Tokenizing Text Held in a Rope Data Structure

Question

I am building a text editor which makes use of a Ragel based tokenizer to support syntax highlighting. I am considering the use of a rope data structure to support efficient modifications and undo/redo operations. Is there a standard approach for tokenizing or searching text contained in this type of data structure? Some characters can cause the tokenizer to consume the rest of the stream.

While I have heard of ropes before, I have never actually needed to use them directly: this is an interesting question. Are there any constraints on what characters are in each node in the rope? Perhaps each token should be its own node? — Snowman, Aug 23 '14 at 1:16
I may be tokenizing at too high a level and have to revisit that area. Rather than having a single token consume an entire comment or string; it is probably wiser to have a begin and end token for those constructs. The renderer could then simply render based on state; now, it renders characters with a style based on the token type it belongs to. I may be able to have a rope that keeps track of whole tokens rather than segments of text. But, I am still curious about operations like searching efficiently. Would I have to reconstruct the string for each search? — sesteel, Aug 23 '14 at 5:20

david.pfx · Accepted Answer · 2014-08-23 10:24:44Z

I'm familiar with the underlying state-machine approach described in your link --- it's been around for decades. It can tokenise/categorise any stream of text that supports a get-next-character operation.

I'm familiar with ropes in the context of a text editor. The usual purpose (as I know it) is to break the strings into portions that have the same display attributes: colour, font, link, line breaks etc. This works well for both editing and display. The major operations are: insert character; delete character; delete token; cut and paste. Maintaining the rope is not easy.

It's not obvious from your question whether you expect the tokeniser to generate the rope. Is it a rope of tokens, where each token has its own display attributes? I would be troubled that editing and tokenisation could interfere with each other. Things like quotes and comments can reach a long way.

No, I don't there is a standard way of combining these ideas. I suspect the right idea is for the tokeniser to generate/regenerate the rope, immediately for the on-screen portion and in the background for the rest. Edits should affect the rope (only) with a short delay before retokenising. The tokenising needs to be interruptible too.

It feels like a reasonable approach, but I'm sure there are many challenges in making it work well. You might want to read some source code (Eclipse? Netbeans?) to see how others do it.

asked	4 months ago
viewed	118 times
active	4 months ago

current community

your communities

more stack exchange communities

Tokenizing Text Held in a Rope Data Structure

migrated from stackoverflow.com Aug 22 '14 at 20:03

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged data-structures text-processing or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Tokenizing Text Held in a Rope Data Structure

migrated from stackoverflow.com Aug 22 '14 at 20:03

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged data-structures text-processing or ask your own question.

Related

Hot Network Questions