Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

I am building a text editor which makes use of a Ragel based tokenizer to support syntax highlighting. I am considering the use of a rope data structure to support efficient modifications and undo/redo operations. Is there a standard approach for tokenizing or searching text contained in this type of data structure? Some characters can cause the tokenizer to consume the rest of the stream.

share|improve this question

migrated from stackoverflow.com Aug 22 '14 at 20:03

This question came from our site for professional and enthusiast programmers.

    
While I have heard of ropes before, I have never actually needed to use them directly: this is an interesting question. Are there any constraints on what characters are in each node in the rope? Perhaps each token should be its own node? –  Snowman Aug 23 '14 at 1:16
    
I may be tokenizing at too high a level and have to revisit that area. Rather than having a single token consume an entire comment or string; it is probably wiser to have a begin and end token for those constructs. The renderer could then simply render based on state; now, it renders characters with a style based on the token type it belongs to. I may be able to have a rope that keeps track of whole tokens rather than segments of text. But, I am still curious about operations like searching efficiently. Would I have to reconstruct the string for each search? –  sesteel Aug 23 '14 at 5:20

1 Answer 1

up vote 1 down vote accepted

I'm familiar with the underlying state-machine approach described in your link --- it's been around for decades. It can tokenise/categorise any stream of text that supports a get-next-character operation.

I'm familiar with ropes in the context of a text editor. The usual purpose (as I know it) is to break the strings into portions that have the same display attributes: colour, font, link, line breaks etc. This works well for both editing and display. The major operations are: insert character; delete character; delete token; cut and paste. Maintaining the rope is not easy.

It's not obvious from your question whether you expect the tokeniser to generate the rope. Is it a rope of tokens, where each token has its own display attributes? I would be troubled that editing and tokenisation could interfere with each other. Things like quotes and comments can reach a long way.

No, I don't there is a standard way of combining these ideas. I suspect the right idea is for the tokeniser to generate/regenerate the rope, immediately for the on-screen portion and in the background for the rest. Edits should affect the rope (only) with a short delay before retokenising. The tokenising needs to be interruptible too.

It feels like a reasonable approach, but I'm sure there are many challenges in making it work well. You might want to read some source code (Eclipse? Netbeans?) to see how others do it.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.