I am building a text editor which makes use of a Ragel based tokenizer to support syntax highlighting. I am considering the use of a rope data structure to support efficient modifications and undo/redo operations. Is there a standard approach for tokenizing or searching text contained in this type of data structure? Some characters can cause the tokenizer to consume the rest of the stream.
migrated from stackoverflow.com Aug 22 '14 at 20:03This question came from our site for professional and enthusiast programmers. |
|||||||||
|
I'm familiar with the underlying state-machine approach described in your link --- it's been around for decades. It can tokenise/categorise any stream of text that supports a get-next-character operation. I'm familiar with ropes in the context of a text editor. The usual purpose (as I know it) is to break the strings into portions that have the same display attributes: colour, font, link, line breaks etc. This works well for both editing and display. The major operations are: insert character; delete character; delete token; cut and paste. Maintaining the rope is not easy. It's not obvious from your question whether you expect the tokeniser to generate the rope. Is it a rope of tokens, where each token has its own display attributes? I would be troubled that editing and tokenisation could interfere with each other. Things like quotes and comments can reach a long way. No, I don't there is a standard way of combining these ideas. I suspect the right idea is for the tokeniser to generate/regenerate the rope, immediately for the on-screen portion and in the background for the rest. Edits should affect the rope (only) with a short delay before retokenising. The tokenising needs to be interruptible too. It feels like a reasonable approach, but I'm sure there are many challenges in making it work well. You might want to read some source code (Eclipse? Netbeans?) to see how others do it. |
|||
|