I need to do some project specific automatic checking of source codes written in C++
Limitations:
- Algorithm and its implementation should be simple, easily maintainable, extendable and understandable without need of Computer Science degree.
- Speed and memory consumption of algorithm does not matter. I can run it at night and see result next morning. So it can be 100-1000x slower than fast but hard to maintain algorithms.
- I want to write it myself and have fine control about it. No parser generator tools like flex, bison etc.
- I dont want to use existing C++ parsers like CLang etc.
- Please in answer dont use formal language theory. Try to explain it with simple language, understandable to ordinary self-taught programmer.
So what is the best and easy to implement universal parsing algorithm if we assume speed and memory is not an issue?
EDIT: I experimenting with this approaches and their combinations:
a) Initially some time ago i tried some form of recursive descent parsing.
partical succes done with it, but its too rigid and hard to maintain, hard to hack and extend.b) Repeated applying set of transformations on array of objects until there is no transformation to apply.
At beginning each character from parsed source code is one object in array Then i have set of transformation functions:
bool Build_...( vector<unique_ptr<CObject>>& objects );
And each of those function removes one or more lower level object from array and replaces them with higher level object for example
Build_Whitespace
finds sequence of space and tab characters and replaces them with oneCWhitespace
object.Build_Comments
find sequence of characters which makes comment and replaces them withCComment
object etc. (I dont want to remove whitespace and comments, whole processing needs to be done with them ) After each cycle is removed some lower level object and replaced with higher level object. And at end of this process there is few higher level objects likeCFunctionDefinition
,CClassDefiniton
etc.So basically it works this way:
bool objectsChanged = true; while ( objectsChanged ) { objectsChanged = false; objectsChanged |= Build_Whitespace( objects ); objectsChanged |= Build_EndLines( objects ); objectsChanged |= Build_Names( objects ); . . objectsChanged |= Build_ForCycle( objects ); . . }
This algorithm is of course very slow, but works, is simple and i think it can parse language of any complexity. It has also advantage of partial parsing, can parse incomplete code because it leaves language objects as is if they have unknown syntax.
c) Currently i am experimenting with parsing using reverse generation of source codes.
Instead of trying to parse source code i am generating syntax tree for all possible source codes and compare each resulting code with one that needs to be parsed until i find one that match. This sounds crazy because there is infinite many possible source codes. But its not so bad as it sounds because comparing can be done early and whole tested subtrees avoided at first generated character which does not match.
Actually i think b) is better for my purposes and easier to use and hack than c) which is more rigid and for succesful parsing of some subset of code needs complete gramatical rules for every possible code.
What other posibilites you know about? What do you think about selected approaches? Some other ideas?