Lexer+Parser code for my “Reedoo” programming language

Question

I've been working on my own programming language, as a hobby for the past couple of months now, called Reedoo. It is implemented in pure C++ 11 and uses only the libraries that come with C++; no external ones. I've never studied Computer Science, but I've done about 2 weeks of software development so far in school, but that's it.

Since I've never studied Computer Science, I've never taken a compiler class, so I'd like some feedback to know whether I'm using techniques that will result in a "useable" language.

The language is currently really, really small. It has a "print" statement, variables, expression evaluation, comments and if statements.

Here's a link to the code on GitHub.

Here's a simple program demonstrating all of the features so far:

if "Hello World" == "Hello World" and "Reedoo" == "reedoo" {
    print "This part of the block shouldn't run"
} else {
    print "This should be displayed in the console."
    print 10 + 2 * (5 + 4)    # Example of expression evaluation and a comment
}

How it works, currently:

Reedoo has a hand-written lexical scanner that scans and input .rdo file and identifies the tokens. Here's the tokens from that program above:

if
cond:string:"Hello World" eqeq string:"Hello World" and string:"Reedoo" eqeq string:"reedoo" 
opencb
print
string:"This part of the block shouldn't run"
sc
closecb
else
cond:string:"Hello World" eqeq string:"Hello World" and string:"Reedoo" eqeq string:"reedoo" 
opencb
print
string:"This should be displayed in the console."
sc
print
expr:(10+2*(5+4))
sc
closecb
sc

Each token is on its own line. The "sc" token stands for semi-colon, the lexer injects semi-colons in place of new lines. When the lexer finds a condition, like in an if statement, it adds the parts of the condition together until it finds the open curly bracket token.

The parser

The parser then takes the tokens and one-by-one adds them together until it matches one of the patterns in the parser. The parser is also hand-written.

The parser makes calls to other functions I wrote also, for example, when evaluating an expression, the parser calls a function I wrote that returns the result of the expression.

When evaluating conditions the parser executes another function that returns either true or false.

Variables

Variables are stored in a C++ map.

I also make extensive use of vectors in all aspects of the lexer / parser and related functions. Is this bad for performance?

Finally, the parser just executes code as it sees it. I don't know enough about to trees to implement one.

I'd just like some feedback on my project. I'd like to know if I am going about things the right way.

Here's the lexer code:

#include <iostream>
#include <string>
#include <cstring>
#include <vector>

#include "lexer.h"

/* Definitions */
/* These are our constants, these are defined as constant at the start of the program so
   that if anything goes wrong in the execution of the code we can always display the
   right kind of errors. */
#define IO_ERROR "[IO ERROR] "
#define SYNTAX_ERROR "[SYNTAX ERROR] "
#define ASSIGN_ERROR "[ASSIGN ERROR] "

using namespace std;

/* Global Variables */
/* Not all of these are actual "keywords" that can be used in programs.
   They are called keywords because they are reserved, either because they
   are specified as keywords in the grammar or because they are reserved by
   the interpreter for internal use. */
std::string reserved[14] = { "print", "string", "sc", "variable", "eq", "undefined", "nl", "num", "expr", "eof", "if", "else", "and", "or" };
/* We store lex_tokens in a vector, we could use an array but specifying an arrays
   size at runtime is technically impossible and the work arounds are a pain. */
std::vector<std::string> lnums;

//string s;

int lnum = 1;
int ecount = 0;

bool rdo_is_reserved(string tok) {
  int i;
  for (i = 0; i < 9;i++) {
    if (tok == reserved[i])
      return true;
    else
      return false;
  }
  return false;
}

vector<string> lex(string prog) {
  std::vector<std::string> lex_tokens;
  int i = 0;
  int start_ce = 0;
  string tok = "";
  string n = "";
  string expr = "";
  bool state = 0;
  bool expr_started = 0;
  bool is_expr = 0;
  bool var_started = 0;
  bool sl_comment_started = 0;
  bool unquoted_str_fnd = false;
  bool block_started = false;
  bool condstarted = false;
  string s = "";
  string v = "";
  string ce = "";
  string condition = "";

  for(i = 0; i < prog.size(); ++i) {
    tok += prog[i];
      if (tok == " " and state == 0) {
        tok = "";
        if (n != "") {
          //is_expr = 1;
          //lex_tokens.push_back(reserved[7] + ":" + n);
        }
        n = "";
        if (v != "") {
          lex_tokens.push_back(reserved[3] + ":\"" + v + "\"");
        }
        v = "";
        var_started = 0;
      } else if (tok == ";" and state == 0) {
        tok = "";
        if (expr.length() >= expr.length()-1) {
          if (expr.substr(expr.length()-1) == "+" or expr.substr(expr.length()-1) == "-" or expr.substr(expr.length()-1) == "/" or expr.substr(expr.length()-1) == "*") {
            if (lnum == 0)
              lnum++;
            cout << SYNTAX_ERROR << "Numbers and expressions must not end with an opperator [line " << lnum << "]" << endl;
            /* If the error count goes about 0, the program immediately exits. This prevents crashes and provides a better user experience. */
            ecount++;
          }
        }
        if (expr != "" and is_expr == 1) {
          lex_tokens.push_back(reserved[8] + ":(" + expr + ")");
        } else if (n != "" and is_expr == 0) {
          lex_tokens.push_back(reserved[7] + ":" + expr);
        }
        if (v != "") {
          lex_tokens.push_back(reserved[3] + ":\"" + v + "\"");
        }
        if (lex_tokens.back() != "sc") {
          lex_tokens.push_back(reserved[2]); 
        }        
        v = "";
        var_started = 0;
        n = "";
        expr = "";
        is_expr = 0;
      } 
      /* Single-line comments */
      else if (tok == "#" and state == 0) {
        /* Start of a single-line comment means end of an expression */
        if (expr != "" and is_expr == 1) {
            lex_tokens.push_back(reserved[8] + ":(" + expr + ")");
          } else if (n != "" and is_expr == 0) {
            lex_tokens.push_back(reserved[7] + ":" + expr);
          }
          /* Also means end of variables */
          if (v != "") {
            lex_tokens.push_back(reserved[3] + ":\"" + v + "\"");
          }
          /* If sl_comment_started doesn't already equal 1, set it to 1 */
        if (sl_comment_started == 0) {
          sl_comment_started = 1;
        }
        v = "";
        var_started = 0;
        n = "";
        expr = "";
        is_expr = 0;
      } else if (sl_comment_started == 1) {
        if (tok == "\n") {
          sl_comment_started = 0;
          if (lex_tokens.size() != 0) {
          if (lex_tokens.back() != "sc") {
            lex_tokens.push_back(reserved[2]); 
          } 
        }
        }
        tok = "";
      } else if (tok == "\r") {
        tok = "";
      } else if (tok == "\t") {
        tok = "";
      } else if (tok == "\n" and state == 1) {

        cout << SYNTAX_ERROR << "EOL found inside of string. [line " << lnum << "]" << endl;
        ecount++;

      } else if (tok == "\n" and state == 0) {

        if (state == 0) {
          tok = "";
          if (expr.length() >= expr.length()-1) {
            if (expr.substr(expr.length()-1) == "+" or expr.substr(expr.length()-1) == "-" or expr.substr(expr.length()-1) == "/" or expr.substr(expr.length()-1) == "*") {
              if (lnum == 0)
                lnum++;
              cout << SYNTAX_ERROR << "Numbers and expressions must not end with an opperator [line " << lnum << "]" << endl;
              /* If the error count goes about 0, the program immediately exits. This prevents crashes and provides a better user experience. */
              ecount++;
            }
          }
          lnum++;
          if (expr != "" and is_expr == 1) {
            lex_tokens.push_back(reserved[8] + ":(" + expr + ")");
          } else if (n != "" and is_expr == 0) {
            lex_tokens.push_back(reserved[7] + ":" + expr);
          }
          if (v != "") {
            lex_tokens.push_back(reserved[3] + ":\"" + v + "\"");
          }
          if (lex_tokens.back() != "sc" and lex_tokens.back() != "opencb") {
            lex_tokens.push_back(reserved[2]); 
          }       
          v = "";
          var_started = 0;
          n = "";
          expr = "";
          is_expr = 0;
        }
      } else if (tok == "%") {
        if (var_started == 0)
          var_started = 1;
      } else if (var_started == 1) {
        v += tok;
        tok = "";
      } else if (tok == "0" or tok == "1" or tok == "2" or tok == "3" or tok == "4" or tok == "5" 
        or tok == "6" or tok == "7" or tok == "8" or tok == "9") {
        if (state == 0) {
          n += tok;
          expr += tok;
        } else {
          s += tok;
        }
        tok = "";
      } else if (tok == "+" or tok == "-" or tok == "*" or tok == "/" or tok == "(" or tok == ")") {
        if (state == 0) {
          expr += tok;
          is_expr = 1;
          tok = "";
          n = "";
        }
      } else if (tok == "=" and state == 0) {
        if (lex_tokens.back() == "eq") {
          if (condstarted == false) {
            lex_tokens.back() = "eqeq";
          } else {
            condition += "eqeq ";
            lex_tokens.pop_back();
          } 
        } else {
        lex_tokens.push_back("eq");
        }
        tok = "";
      } else if (tok == reserved[12] and state == 0) {
          if (condstarted == false) {
            lex_tokens.push_back("and");
          } else {
            condition += "and ";
          } 
        tok = "";
      } else if (tok == reserved[13] and state == 0) {
          if (condstarted == false) {
            lex_tokens.push_back("or");
          } else {
            condition += "or ";
          }
        tok = "";
      } else if (tok == reserved[10]) {
        lex_tokens.push_back(reserved[10]);
        condstarted = true;
        condition = "cond:";
        tok = "";
      } else if (tok == reserved[11]) {
        lex_tokens.push_back(reserved[11]);
        tok = "";
      } else if (tok == "{") {
        block_started = true;
        condstarted = false;
        lex_tokens.push_back(condition);
        lex_tokens.push_back("opencb");
        tok = "";
      } else if (tok == "}") {
        if (expr != "" and is_expr == 1) {
            lex_tokens.push_back(reserved[8] + ":(" + expr + ")");
          } else if (n != "" and is_expr == 0) {
            lex_tokens.push_back(reserved[7] + ":" + expr);
          }
          if (v != "") {
            lex_tokens.push_back(reserved[3] + ":\"" + v + "\"");
          }
          v = "";
        var_started = 0;
        n = "";
        expr = "";
        is_expr = 0;
        if (lex_tokens.back() != "opencb" and lex_tokens.back() != "sc") {
          lex_tokens.push_back("sc");
        }
        lex_tokens.push_back("closecb");
        block_started = false;
        tok = "";
      } else if (tok == reserved[0]) {
        lex_tokens.push_back(reserved[0]);
        tok = "";
      } else if (tok == "\"") {

        if (state == 0) {
          state = 1;
        } else if (state == 1) {
          state = 0;
          if (condstarted == false) {
            lex_tokens.push_back(reserved[1] + ":" + s + "\"");
          } else {
            condition += reserved[1] + ":" + s + "\" ";
          }
          s = "";
          tok = "";
        }
      } else if (state == 1) {
        s += tok;
        tok = "";
      } 

      if (ecount > 0) {
        exit(1);
      }

  }
  //cout << lex_tokens.size() << endl;
  for (i = 0; i < lex_tokens.size();i++) {
    //cout << lex_tokens[i] << endl;
  }
  return lex_tokens;
}

It's quite long, but I'll explain how it works. There is a variable called tok. tok gets 1 character longer each time the loop runs until it is equal to one of the keywords from the reserved array.

Strings are similar. The variable s is used to hold a string as it is being identified, when the entire string is identified the token is pushed onto the tokens vector.

Again, numbers are identified the same way as strings. The only difference being that that variable n is used.

Conditions are identified similarly to this as well. When an if keyword is found, the lexer assumes that everything following it is part of the condition, until it see's a closecb token.

The parser

#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
#include <vector>
#include <cstdlib>
#include <cstdio>
#include <algorithm>
#include <functional>


#include "parser.h"
#include "variables.h"
#include "reedoo.h"
#include "io.h"
#include "cond.h"

using namespace std;

void parse(vector<string> tokens) {
  int errcount = 0;
  int linenum = 1;
  int i = 0;
  bool cond = false;
  int cond_result = 2;

  while (i < tokens.size()) {

    TOP:if (tokens[i] + " " + tokens[i+1] == "print sc") {
      cout << SYNTAX_ERROR << "'print' supplied without anything to print [line " << linenum << "]" << endl;
      errcount++;
      i+=2;
      break;
    }

    if (tokens[i] + " " + tokens[i+1].substr(0,6) + " " + tokens[i+2] == "print string sc" or
        tokens[i] + " " + tokens[i+1].substr(0,3) + " " + tokens[i+2] == "print num sc" or
        tokens[i] + " " + tokens[i+1].substr(0,4) + " " + tokens[i+2] == "print expr sc" or
        tokens[i] + " " + tokens[i+1].substr(0,8) + " " + tokens[i+2] == "print variable sc") {
      if (tokens[i+1].substr(0,8) == "variable") {
        doPRINT(goGETVAR(tokens[i+1]));
      } else {
        doPRINT(tokens[i+1]);
      }
      i+=3;
    } else if (tokens[i].substr(0,8) + " " + tokens[i+1] + " " + tokens[i+2].substr(0,3) + " " + tokens[i+3] == "variable eq num sc" or
        tokens[i].substr(0,8) + " " + tokens[i+1] + " " + tokens[i+2].substr(0,6) + " " + tokens[i+3] == "variable eq string sc" or
        tokens[i].substr(0,8) + " " + tokens[i+1] + " " + tokens[i+2].substr(0,4) + " " + tokens[i+3] == "variable eq expr sc" or
        tokens[i].substr(0,8) + " " + tokens[i+1] + " " + tokens[i+2].substr(0,8) + " " + tokens[i+3] == "variable eq variable sc") {
      doASSIGN(tokens[i],tokens[i+2]);
      i+=4;
    } else if (tokens[i] + " " + tokens[i+1].substr(0,4) + " " + tokens[i+2] == "if cond opencb") {
      //cout << eval_cond(tokens[i+1].substr(5)) << endl;
        cond_result = eval_cond(tokens[i+1].substr(5));
        if (eval_cond(tokens[i+1].substr(5))) {
            // Run true block
            //cout << "TOKENS: " << tokens[i+1].substr(5) << eval_cond(tokens[i+1].substr(5)) << endl;
            //i+=3;
        } else {

            //cout << "TOKENS: " << tokens[i+1].substr(5) << eval_cond(tokens[i+1].substr(5)) << endl;
            while (tokens[i] != "closecb") {
              i++;
            }
            i++;

        }
        i+=3;
    } else if (tokens[i] == "closecb") {
        if (tokens[i+1] == "else") {
          i+=1;
          if (cond_result == 0) {

          }
        } else {
          i+=2;
        }
    } else {
        break;
    }

    if (i >= tokens.size()-2) {
      break;
    }

    if (errcount > 0) {
      break;
    }
  }
}

The parser is actually quite simple, it just loops through the tokens, each time adding the next token on, until it matches one of the if statement conditions, if it does, the parser hands the hard work off to other functions in separate files.

The parser then pushes the iterator on, so for example, if the statement the parser identified was 3 tokens long, the parser it push the iterator on 3 tokens.

If you have something as complex as a language time to break out the appropriate tools. Lex/Yacc (or there descendants). — Loki Astari, yesterday

glampert · Accepted Answer · 2014-09-16 14:12:00Z

C style:

Well, the first thing I have to say is that your code is very C-ish. The first action here would be to move all those loose functions into two C++ classes, Lexer and Parser.

Another C vice of your code is to declare variables on top of a function. lex() is champion at that, with 18 variables piled at the top of the function. Absolutely don't do that in C++. Declare a variable in its point of use, restricting scope to where the var is needed.

Counters in for loops, especially, should be declared inside the loop:

int i;
for(i = 0; i < prog.size(); ++i) // poor practice
----
for(int i = 0; i < prog.size(); ++i) // much better

And yet another C habit is to use the /**/ (multi-line comments) everywhere. Why not use the much more practical // for single liners?

#define IO_ERROR "[IO ERROR] "
#define SYNTAX_ERROR "[SYNTAX ERROR] "
#define ASSIGN_ERROR "[ASSIGN ERROR] "

These macros are not very nice either. You should prefer a const std::string or const char[]. For variables/constants that are file scoped, it is a good practice to declare them inside an unnamed namespace, effectively making the constants file-scoped:

namespace 
{
    const std::string IO_ERROR     { "[IO ERROR] "     };
    const std::string SYNTAX_ERROR { "[SYNTAX ERROR] " };
    const std::string ASSIGN_ERROR { "[ASSIGN ERROR] " };
}

The other globals you have should be made members of a class, but if they were to remain as globals, then they should also be wrapped with an unnamed namespace.

using namespace std:

This was already mentioned and I second for that. Don't use namespace std because that defeats the purpose of a namespace, which is to allow identical names to coexist without clashes.

Possible bug?

bool rdo_is_reserved(string tok) {
  int i;
  for (i = 0; i < 9;i++) {
    if (tok == reserved[i])
      return true;
    else
      return false;
  }
  return false;
}

This function is iterating from 0 to 9, however, the reserved vector has 14 elements. Is this intentional or a bug? Did you mean reserved.size()? But this function is useless anyway, since the standard library provides std::find(), which you should be using instead.

Commented lines leftover:

if (n != "") {
  //is_expr = 1;
  //lex_tokens.push_back(reserved[7] + ":" + n);
}

Please clean up lines that are commented out. This is only liable to cause distraction and confuse the reader. Also, it conveys no important info since you can't know if the commented out code is correct or not.

Large functions:

The lex() function is a massive beast. I wouldn't want to be the poor programmer tasked with the job of modifying it or fixing a bug caused by it. This function needs an urgent refactoring. It should be broken down into several helper methods.

Calling exit():

lex() also calls exit(1) in its body. This is a bit harsh, don't you think? Terminating the program like this, without cleanup or error message. It should instead throw an exception. This is C++ and we have much more modern ways of handling errors.

goto and the lost label:

You have a label inside the parse() function, TOP:, which I would expect be paired with a goto statement. However, I couldn't find a single goto inside the function. So what is that label even doing there?

Since you never did CS school, you might not be aware of the status of goto. Lets just say that it is quite controversial. I for one really don't think goto still have a place in modern C++, like I said before, the language has exceptions and automatic resource management, so goto is pretty arcane.

couts:

Your code is also littered with cout calls, most are commented out. You shouldn't be using it directly inside functions like you did. Most users won't want the verbose output on every execution of the program. If you need them for debugging, a better approach is to wrap those calls into a macro that can be disables at compile time, e.g.:

#define DEBUG_LOG(msg) do { std::cout << msg << std::end; } while(0)

Or use a better, more mature, logger API. Do a search and you will find a ton of good options.

#define DEBUG_LOG(msg) do{ std::cout << msg << std::end; }while(0) would be better. — firda, 12 hours ago

firda · Answer 2 · 2014-09-15 19:49:57Z

This will be quite big project it seems. I will try to give you some hints (I have written something like ECMA-262 / JavaScript compiler and iterpreter with custom bytecode as school project).

If you really plan to develop big language, then you should probably learn about LEX, YACC and BISON, but you'd have to learn about formal grammar for parsers.

If you are fine with learning by trial and error, your Lexar is not bad, but I would advise to start writing some custom classes and enums instead of all those strings:

std::string reserved[14] = { "print", "string", "sc", "variable", "eq", "undefined",
    "nl", "num", "expr", "eof", "if", "else", "and", "or" };

You should use some constants to make your code readable and maintainable:

enum LaxarWords {
    lxPrint, lxString, lxSemiColon, ...

You can then immediatelly use these to produce some bytecode instead of the text representation. It is not a requirement, but a personal hint - I have actually started with small calculator, infix-to-postfix and infix-to-prefix conversion algorithms, expression evalution, and finally ended with ECMA-262. The byte code I have created used prefix notation, something like this:

while(greater(var("x"),const("0")), code(decrement(var("x"))))

This can be rewritten in some byte-code and recursively executed (using e.g. array of function-pointers). No need for the byte-code, good classes/structures with pointers should do.

When I read your code, I have noticed using namespace std; but std::string and std::vector. You should probably choose if you want to use std:: prefix or place some using std::string and such. You can run into unexpected troubles with using namespace std.

Parser

void parse(vector<string> tokens) {
  .
  .
  .
  while (i < tokens.size()) {

    TOP:if (tokens[i] + " " + tokens[i+1] == "print sc") {

This does not look good:

First, you have to copy the whole vector, using const vector<string>& should be better, if you don't need to change the tokens inside parser.
You are accessing the vector past the end
Place the label TOP: on separate line, please, this looks quite ugly.

This IF definitely needs some work, some parsing helper (like get_word) and separate if(current_token == "print") { ... (fetch the string to some variable first).

if (tokens[i] + " " + tokens[i+1].substr(0,6) + " " + tokens[i+2] == "print string sc" or
        tokens[i] + " " + tokens[i+1].substr(0,3) + " " + tokens[i+2] == "print num sc" or
        tokens[i] + " " + tokens[i+1].substr(0,4) + " " + tokens[i+2] == "print expr sc" or
        tokens[i] + " " + tokens[i+1].substr(0,8) + " " + tokens[i+2] == "print variable sc") {

Thanks for your feedback! I'll take it on board and fix up my interpreter. :D — Francis, 23 hours ago
Did you think about nested if cond { if cond2 { ...? You'll need recursion and therefore a change in parser() signature - it should return position where the parsing ended (or size_t& as argument). — firda, 12 hours ago
I'm not sure why, but it works. I'll do my best not to break it. :D — Francis, 2 hours ago

asked	yesterday
viewed	409 times
active	today

current community

your communities

more stack exchange communities

Lexer+Parser code for my “Reedoo” programming language

2 Answers 2

Parser

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ c++11 parsing or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Lexer+Parser code for my “Reedoo” programming language

2 Answers 2

Did you find this question interesting? Try our newsletter

Parser

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ c++11 parsing or ask your own question.

Related

Hot Network Questions