Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

I need to do some project specific automatic checking of source codes written in C++

Limitations:

  • Algorithm and its implementation should be simple, easily maintainable, extendable and understandable without need of Computer Science degree.
  • Speed and memory consumption of algorithm does not matter. I can run it at night and see result next morning. So it can be 100-1000x slower than fast but hard to maintain algorithms.
  • I want to write it myself and have fine control about it. No parser generator tools like flex, bison etc.
  • I dont want to use existing C++ parsers like CLang etc.
  • Please in answer dont use formal language theory. Try to explain it with simple language, understandable to ordinary self-taught programmer.

So what is the best and easy to implement universal parsing algorithm if we assume speed and memory is not an issue?

EDIT: I experimenting with this approaches and their combinations:

  • a) Initially some time ago i tried some form of recursive descent parsing.
    partical succes done with it, but its too rigid and hard to maintain, hard to hack and extend.

  • b) Repeated applying set of transformations on array of objects until there is no transformation to apply.

    At beginning each character from parsed source code is one object in array Then i have set of transformation functions:

    bool Build_...( vector<unique_ptr<CObject>>& objects );
    

    And each of those function removes one or more lower level object from array and replaces them with higher level object for example Build_Whitespacefinds sequence of space and tab characters and replaces them with one CWhitespace object. Build_Comments find sequence of characters which makes comment and replaces them with CComment object etc. (I dont want to remove whitespace and comments, whole processing needs to be done with them ) After each cycle is removed some lower level object and replaced with higher level object. And at end of this process there is few higher level objects like CFunctionDefinition, CClassDefiniton etc.

    So basically it works this way:

    bool objectsChanged = true;
    while ( objectsChanged ) {
        objectsChanged = false;
        objectsChanged |= Build_Whitespace( objects );
        objectsChanged |= Build_EndLines( objects );
        objectsChanged |= Build_Names( objects );
        .
        .   
        objectsChanged |= Build_ForCycle( objects );      
        . 
        .
    }
    

    This algorithm is of course very slow, but works, is simple and i think it can parse language of any complexity. It has also advantage of partial parsing, can parse incomplete code because it leaves language objects as is if they have unknown syntax.

  • c) Currently i am experimenting with parsing using reverse generation of source codes.

    Instead of trying to parse source code i am generating syntax tree for all possible source codes and compare each resulting code with one that needs to be parsed until i find one that match. This sounds crazy because there is infinite many possible source codes. But its not so bad as it sounds because comparing can be done early and whole tested subtrees avoided at first generated character which does not match.

Actually i think b) is better for my purposes and easier to use and hack than c) which is more rigid and for succesful parsing of some subset of code needs complete gramatical rules for every possible code.

What other posibilites you know about? What do you think about selected approaches? Some other ideas?

share|improve this question

closed as too broad by gnat, MichaelT, m3th0dman, Basile Starynkevitch, James McLeod Jan 8 at 11:46

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs. If this question can be reworded to fit the rules in the help center, please edit the question.

2  
Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see How to Ask –  gnat Jan 8 at 8:16
    
@gnat. Ok, wait a while... –  user3123061 Jan 8 at 8:20
4  
What problem are you trying to solve with this parser? C++ is one of the hardest languages to parse, but sometimes you can go a very long way with a simple text search or regular expression match. –  Bart van Ingen Schenau Jan 8 at 8:26
4  
For C++ there is no simple[st] parsing algorithm. C++ has a context-dependent grammar (what you parse can mean different things, based on what you parsed before) and because of this, most problems that can be solved through parsing C++, have to parse the code fully. You have ~ three solutions here: use an existing parser (like CLang), solve your problem with a c++-agnostic approach (like regular expressions?), or write your own parser (this is non-trivial and you can expect it to take a few years of effort, until you have written a partial and very buggy parser). –  utnapistim Jan 8 at 9:17
    
@Bart van Ingen Schenau: It can sounds naive, but i want generic solution which can be easily adapted for every parsing problem i can face with. :-) –  user3123061 Jan 8 at 10:41

2 Answers 2

up vote 6 down vote accepted

Parsing full C++ is very hard. And in a large enough code, nearly all of C++ syntactical features are used (since the hard to parse features occur in most C++ code).

Also, what is actually parsed is preprocessed output. So you should at least parse the output of the preprocessor.

For example, it is quite hard -since very contextual- to understand if an occurrence of sort refers to one of the std::sort (which one?) or to something else.

BTW, even if you believe to use a small subset of C++ syntax, it is still hard to parse (since contextual), and you need to somehow understand the standard headers (which are internally using a lot of C++ features).

So I believe you are going on the wrong route. The realistic way is to reuse an existing C++ parser. You could buy one (Edison C++ parser is rumored to be very expansive), or you could reuse one from an existing C++ free software compiler, such as Clang/LLVM (which you don't want to use: I don't know it well but I guess it would be reasonable to use Clang) or GCC.

GCC can be customized, notably at the Gimple level, by coding your own extension in MELT; you won't see C++ parse trees, but Gimple instructions. MELT can also handle GCC Generic parse trees.

You should explain what practical project specific automatic checking you want to do. FWIW, MELT was exactly designed for such use cases (project specific checking).

Actually, if your code base is small, it might be perhaps easier to convert it (manually) to something else (perhaps your own C or C++ like DSL) and implement a translator from your language to C++; but to do that you need a computer science background

BTW, parsing is not the hardest problem (it is very complex, but compilers solved it). Analyzing your code might be harder (intractable), or undecidable (read about the halting problem). Both parsing and static program analysis require some expertise (this is why Computer Science degrees are useful for) - so get or buy some.

share|improve this answer
1  
Why would you not recommend the LLVM C++ parser? It's basically the only good free software out there for that purpose. –  Thomas Eding Jan 8 at 22:29
    
I am not recommending against Clang (which is the LLVM parser). But the OP explicitly asked "I don't want to use existing C++ parsers like CLang"; I'm trying to explain it is a mistake; I just happen to know better GCC since I am the main author of MELT –  Basile Starynkevitch Jan 9 at 5:48
    
@ThomasEding: why do you believe MELT (for and with GCC) is not a good free software for that purpose? –  Basile Starynkevitch Jan 9 at 9:06

No human or even machine can ever write a parser for a language as incomprehensible as C++. There is no finite set of axiomatic rules in the universe to describe it without contradiction (i.e. bugs). Here is a quote from Steve Yegge's Tour de Babel rant:

As for C: it's so easy to write a C compiler that you can build tools on top of C that act like introspection. C++, on the other hand, is essentially un-parseable, so if you want to write smart tools that can, for example, tell you the signatures of your virtual functions, or refactor your code for you, you're stuck using someone else's toolset, since you sure as heck aren't gonna parse it. And all the toolsets for parsing C++ out there just plain suck.

Also, here is a George Polya quote from his famous (mathematical) problem solving book:

If there is a problem you can't solve, then there is an easier problem you can solve: find it.

Choose a language for which you can write a parser in less than 100 lines of code! Lisp for example... or Brainfuck!

share|improve this answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.