Sane explanation of compilers: Part 1

Contents

  1. Preface
  2. I. How do we understand code?
    1. Math
      1. Regular expressions
      2. Grammars
    2. Lexers and parsers
      1. Implementational difficulties with grammars
      2. Practical usage
      3. Parsing algorithms
        1. LL
        2. LR
        3. LR’s family
        4. More info
      4. Abstract syntax tree
    3. Implementing a front-end
      1. lex
      2. yacc
      3. Doing it properly
  3. Appendinx

Recently I signed up for a small university course about compilers. As I found out, most materials on the topic are either so surface level you barely learn anything, or so specific you need to already know the subject. My goal is to try and fill that void, explaining everything in a detailed way that is understandable for beginners.

Preface

Processors work with numbers, every action (instruction) a processor can do is defined by numbers. However, writing code only with numbers is tedious, error-prone and simple stuff sometimes take ages to implement. That is why, over time, programmers have tried to abstract sets of instructions into words and text constructions that are easier to work with.

Thanks to abstraction, programmers have been able to create more and more complicated projects with less and less work, that is more portable and compatible with different machines. The trade-off is that we need to translate our nice programmer code to numbers a CPU would understand. This job is delegated to a special type of program, called a compiler1, which is separated into three main layers/components:

  1. Front-end: it understands the program, what the code is trying to do
  2. Middle: does optimizations and simplifications, though this part is often inside the back-end
  3. Back-end: generates the actual machine instructions

In this post, we’ll be looking only at the first one.

I. How do we understand code?

Syntax in current programming languages can be quite sophisticated and ambiguous. For example (taken from wikipedia) this C++ snippet

void f(double d) {
  int i(int(d));
}

could be interpreted either as a:

  1. declaration of variable i, which is initialized with the value of int(d) (which just casts d to an integer), or
  2. function declaration, since superfluous parenthesis around function parameter declarations are allowed (meaning it is treated as int i(int d))

Note that I’m talking about syntax, the valid selection and layouts of characters, and not semantics, the underlying meaning. Compiled programming languages associate specific templates of code (text) to their “corresponding” machine instructions, there isn’t anything special you need to understand in a piece of code. In natural languages, you could construct a sentence which follows all grammatical rules, but is absolute nonsense, like “Colorless green ideas sleep furiously”.

The problem of semantics is extremely difficult, however in our case we only care about syntax. And this is where math comes into play.

Math

A word of caution, I’ll use a decent amount of fancy names, so I have a way to more easily refer to different concepts. Don’t be scared and skip over them if they’re confusing you.

There is a special field in mathematics that deals with syntactical analysis: formal language theory. Although even the basics merit a whole university course, I’ll still try to roughly explain what’s important, for those that haven’t studied it.
This field works primarily with strings via the help of two (relevant for our purposes) instruments: regular expressions and grammars.

Regular expressions

You might’ve heard or even already used regexes in a text editor. They’re small text expressions which specify a text pattern. When a word can be “generated” from the pattern, we say the regex matches the word.

At their very base, such a pattern is created with the following rules:

In the modern day, regular expressions permit additional constructions like matching the start or end of a string, matching character groups or character classes. There are also sometimes features which significantly increase the matching capabilities of normal regexes, like lookahead and lookbehind assertions.

Grammars

However, even then regular expressions are fairly weak, they can’t match recursive structures. Let’s say you want to be able to have math expressions in your language, you can have braces inside braces as much as you like, but in a certain manner:

Such a structure is recursive, because it can contain “itself”, 1+2 can be a standalone math expression, but it can also be contained in another expression, like (1+2)*3, which itself could be inside another bigger expressions and so on. For this task we have a whole separate tool: grammars.

The syntax is different from regexes: a grammar is divided into (production) rules, each rule is named and specifies an expression. Expressions are a concatenation of strings and/or names2. Names won’t appear in our matched strings, the idea is that a name will be replaced with it’s expression, kind of like a placeholders. Additionally, there can be multiple rules with the same name, meaning a rule can be replaced by different stuff at any point of the string generation.

Always one name is chosen to be the starting point, and generation continues from there. In math literature, generally we write rules in the form Name -> Expression3, and often names are a single capital letter. Additionally, often the starting name is called S. Finally, in future examples, I’ll use the arrow to show derivations, how the generated string looks after one or two replacements of the name with one of it’s expressions.

For an example, with the grammar

S -> aSb
S -> cc

we can generate the string aaaaccbbbb (remember, our starting point is S):

S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaaaSbbbb ⇒ aaaaccbbbb

or accb:

S ⇒ aSb ⇒ accb

where a, b and c are just English characters. To simplify our lives, often rules with the same name are merged into one rule, separated by vertical bars (S -> aSb | cc). This is practically alternation but in a different context.

As for our math expressions, we could create something like this:

N -> 0 | 1 | 2 | ... all possible numbers
E -> N | (E) | E + E | E - E | E * E | E / E | E ^ E

which will successfully be able to match (generate) ((3 * 3) - (2 ^ 3)) + 2:

EE + E ⇒ (E) + N ⇒ (E - E) + 2 ⇒ ((E) - (E)) + 2 ⇒ ((E * E) - (E ^ E)) + 2 ⇒ ((N * N) - (N ^ N)) + 2 ⇒ ((3 * 3) - (2 ^ 3)) + 2

but there are no expressions (rules) that allow ()3+3() or )3(-6 to be “matched”.

Lexers and parsers

You might’ve noticed that for N I use three dots and the note “all possible numbers” to signify that N can be any positive whole number. In practice we need formally define that grammar:

N -> 0 | 1 | 2 | ... | 9 | 0N | 1N | 2N | ... | 9N

With this you might’ve realised that grammars can do anything a regular expressions can: concatenation and alternation are “built-in”, and the Kleene star can be done with another rule in the form R -> strR | str4. The reason we use both is that grammars are significantly slower and more complicated in code than regexes.

Implementational difficulties with grammars

Implementing (very roughly) the regular expression acce(n | p)*t is quite easy:

bool matchesp(string str) {
    string regex = "acce_t";

    for (int si = 0, ri = 0; si < str.size() && ri < regex.size(); si++, ri++) {
        while (regex[ri] == '_')
            if (str[si] == 'n' || str[si] == 'p')
                si++;
            else
                ri++;
        }

        if (str[si] != regex[ri])
            return false;
    }
    return true;
}

In brief, for concatenated characters we do a simple comparison per character. When we reach “_”, which I use to specify we’re in the (n | p)* part of the regex, we do comparisons with the characters n and p as much as we can, after which we continue as normal.

Of course, this version only supports one regular expressions, however a “proper” implementation would look conceptually similar: a couple of different loops and equality checks, with maybe an additional array or two. The huge benefit is that that loops and arrays are stupid fast, a (simple) C++ for loop could be translated into a handful of processor instructions.

On the other hand, usually grammars are more complicated and resource-intensive. Certain ones are simple enough to implement, like S -> aSb | cc:

bool matchesp(string str) {
    if (str == "cc")
        return true;
    else if (str[0] == 'a' && str[str.size() - 1] == 'b')
        return matchesp(str.substr(1, str.size() - 2));
    else
        return false;
}

However think about more complicated cases5, like S -> aSbSa | cc. Implementing this one is quite a bit more difficult, since we don’t know where b is located on the first iteration, it could either be the 4th character accb...a or the 9th aaccbccab...a. Searching for it is also complicated, it could be the first, the last or any other b in the middle of the string.

Another difficulty with grammars is ambiguity, let’s look at

E -> E + E | E * E | n

how would it generate n + n * n? There are two completely valid ways to do so:

EE + E ⇒ n + E * E ⇒ n + n * n
EE * EE + E * n ⇒ n + n * n

How exactly we deal with these problems comes later, for now the takeaway is that we have to try and use regular expressions wherever we can, and leave grammars as a bit of a last resort.

Practical usage

Let’s think a bit more about grammars: they’re really good at handling structure, especially recursive structure, so it’s best if we use them only for that. Our math expression grammar (from here) shouldn’t really care if we’re trying to sum integers or floats or chars6, it’s important how everything is laid out and that our operations are done with numbers.

This is why we’re going to separate and generalize the code into tokens, basically into building blocks. As a rough example, we want 1 + (2 * 3) to be transformed into NUMBER MATHOP LBRACE NUMBER MATHOP NUMBER RBRACE, int var = 13; into TYPE NAME ASSIGNMENT NUMBER and so on, where “TYPE”, “NUMBER”, “NAME”, … are tokens. Now we’re reducing the amount of “objects” our grammars need to work on: from all characters to many (but significantly fewer) tokens. Also we’re simplifying our work, since we’re trying to figure out the overall structure, it doesn’t matter if we’re multiplying or summing numbers, it’s important that we’re doing a mathematical operation on them. Of course, each token will actually be a pair, containing it’s name (NUMBER, TYPE, …) and it’s value, but our grammars will work with the names only and output the actual values at the very end. Usually we identify at least six token types:

As you might’ve noticed, here we can use regular expressions, grouping consecutive characters is what regexes are good for. Additionally, we’re going to be reading the file, a decently slow operation, only once and for a comparatively very little time.

In the compiler front-end, the part that implements our regular expressions is called the “lexer”, and the part that implements our grammars - the “parser”. In the last section we already talked about how easy implementing regexes (lexers) is and how difficult implementing grammars (parsers) is, let’s look at the latter.

Parsing algorithms

Proper algorithms that “implement” grammars are kind of complicated, and most can handle only specific types of grammars. Generally we categorize them into:

Furthermore, if our algorithms allow for ambiguity (meaning we can replace a name with any one of multiple expressions), we divide them into:

The advantage and disadvantages of every type kind of depend on the exact algorithm and grammar. Looking at some of the more famous ones:

LL

LL: top-down left-most derivations algorithm that can parse only specific types of grammars. In these grammars, every expression is uniquely identifiable by just looking at the characters (tokens) in the very beginning. The amount of token we need to look for is very important, so we usually define LL(k) as an LL grammar that needs to look at k amount of tokens. An “LL(k) grammar” is, by extension, a grammar which can be parsed with an LL(k) parser.

So, the way we’ll check if an expression matches our grammar, is by starting with the start name and reading the next k character from the input. We check the first k character of all expressions, and if anyone of them is a name, we try with substitution for every possibility with that name, essentially peeking into the name’s possibilities.

LR

LR: bottom-up right-most derivations algorithm. Conceptually, LR works in the exact opposite way to LL: for every character in the input, we find a rule that substitutes it (or that substitutes the current collection of names and character) and do so until we’ve reached the starting name.

Figuring out an approach isn’t as intuitive and easy as with LL, but it still isn’t too hard. It all boils down to using a whole bunch of conditional statements and tracing where we “came” from.

LR’s family

LR is more of a family of algorithms, because there are multiple different ways to generate the table, each with it’s own benefits and drawbacks. This subject is a really big can of beans, which I’m not knowledgeable enough to talk about. What you need to know is that there are 3 often used types: CLR, SLR and LALR.

LALR is the most commonly used LR parser, however most compilers are custom-tailored pieces of software that very loosely follow the methods of LL and LR.

More info

If you want to learn more, I would highly suggest:

Abstract syntax tree

We already decided that the lexer will output tokens (and characters) to the parser, but we didn’t talk about what the parser is supposed to output. It definitely has to be some sort of structural representation of the overall code, for example, for 3 + 1 * 2 we have to be able to easily find out what the left and right handsides of the + and * are.

Let’s draw out7 this relationship, where the left arrow connects to the left handside and the right accordingly:

Hey, that looks an awful lot like a tree! Trees are nodes, some sort of elements, where there is a connection from one element to some other elements, with the only rule being that cycles aren’t allowed:

Great, but can we represent everything else with a tree? Why, of course, for example we can represent:

if (a > b)
    c = 5;
else
    d += 10;

as

You might ask how we generate the tree, and the answer is simple: on every step we add nodes to the tree, depending on the situation. Let’s look at a familiar example:

N -> 1 | 2 | 3
E -> N | (E) | E + E | E * E

Let’s look at the input string 2 * (1 + 3), while abstracting ourselves from any particular algorithm:

The overall algorithm is to add and change nodes every time you match something with a grammar rule, and the adding and changing is done, depending on the language, the current structure and so on.

When discussing the middle and back-end layers, we’ll talk about how to work with a generated AST, but for now your clue is the visitor pattern.

Implementing a front-end

If you want, you could implement your own lexer and parser manually, but as a start, it’s worthwhile to try and use already created stuff. Compiler-compilers are tools that create parsers (and more) from their normal formal definition.

The two popular ones from the old days are lex and yacc. lex is a lexer generator and yacc is a LALR parser generator, both of which generate a corresponding C implementation of the given regular expressions and grammars.

lex

Lex is the original program, from 1975, in the modern day it’s successor/implementation is the open-source flex.

A lex source file is divided into three parts:

%{
// C code that will be put in the program verbatim
%}

/* Definitions */

%%

    /* Rules */

%%

// C code

where %{%} is used for includes and declarations, the rules are your regular expressions and the C code is the general logic of the lexer. Usually you wouldn’t write anything in the C code section, the defaults are fine, but you’ll probably need to do one or two includes in the %{%} section.

Rules are in the form regex { C_code }. To simplify your life, you can write only the regex or a part of it inside the “definitions” part of the file, and give it a name to use later. Referencing those names in later regexes is done by surrounding them in curly braces. Usually the C_code in the regex would be a return (more commonly called “emit”) of a token, which will be available in a header file, but more on that later.

Let’s look at a very simple example program: every whole number is printed as “NUMBER”, while everything else is printed as-is:

%{
%}

%%

[0-9]+ { printf("NUMBER"); }
[A-z]+ { printf("%s", yytext); }

%%

The syntax [0-9] is the same as (0 | 1 | ... | 9), it works for any letters and is available in all modern regexes. Note that inside every regex’s “body” you have an assortment of variables already set, one of them being yytext which is just the whole match. Additionally lex matches the longest possible match, so if we had a rule [0-z], only it will be matched.

If this file is located in lexer.l, we can transform it into a C file with:

flex lexer.l

Usually the result file is always named lex.yy.c. Afterwards, we can compile the C code into a working compiler:

gcc lex.yy.c -lfl

Without the -lfl flag, we would get a linking error, because of the main and yywrap function, which are required but aren’t defined in our generated code. There are three ways to fix this problem:

  1. The one I’ve opted for, use the -lfl flag to compile our C code with a (flex) library which provides both functions. Usually the lexer isn’t meant to be ran by itself, so for the current time being, -lfl is the best option, the other two are fine when we get to yacc.

  2. Put %option noyywrap in the definitions section, which means the lexer will only read one file.

  3. Define it ourselves. It should return 0 if our lexer should continue with the next given file, after finishing with the current one, or 1 if it should stop execution. The function could be written in either “C code” section, but I personally prefer the last one.

    int yywrap() { return 0; }
    

Of course, you can use your lex implementation or C compiler of choice, and there this might not be a problem. Now, if you run ./a.out you would get a prompt, every line you write will be lexically analyzed and in our case, everything will be printed out. Entering

The number 10 changed to 8, 38, and 19 sequentially, before becoming 0.

will output

The number NUMBER changed to NUMBER, NUMBER, and NUMBER sequentially, before becoming NUMBER.

Now let’s try something more sophisticated, let’s surround numbers and strings in HTML tags, the stuff that power webpages. We would want numbers to be in the format:

<span class="num">number</span>

and strings:

<span class="str">string</span>

We’ll define numbers as both whole digits and floating point numbers (with sign) and strings can be surrounded in either single or double quotes.

%{
%}

/* Only multiline comments are supported, not comments that start with // */
/* This is a definition, it's in the format "NAME regex" */
DIGIT [0-9]

%%
    /* Comments in here have to be indented, otherwise you'll get an error */
    /* ? means 0 or 1 of the preceding group*/
    /* + means 1 or more of the preceding group*/
    /* the backward slashes are used because . and + have special meaning in regexes*/
[\+-]?{DIGIT}+(\.{DIGIT}+)?   { printf("<span class=\"num\">%s</span>", yytext); }

    /* [^] is a group of characters, like [], however it specifies "all characters except the ones given"*/
    /* this way if we have "This is a string with " three quotes" it's not going to match the whole thing*/
\"[^"]*\"|'[^']*'             { printf("<span class=\"str\">%s</span>", yytext); }

%%

With the input

John said 10 times "2 + 2 = 5" but '2 + 2 = 4'!

we’ll get

John said <span class="num">10</span> times <span class="str">"2 + 2 = 5"</span> but <span class="str">'2 + 2 = 4'</span>!

You’ve probably already noticed that the number inside the strings aren’t matched, because the string rule results in a longer match. In certain programming languages, that don’t have string interpolation this is fine, but for others our regexes have to be a lot more complicated.

yacc

Similarly to lex, yacc originates from 1975, and it’s modern replacement is GNU Bison.

A yacc file looks very similarly to a lex one:

%{
// C code that will be put in the program verbatim
%}

/* Declarations */

%%

    /* Rules */

%%

// C code

however rather than definitions you have declarations, and the rules are in a completely different format (because they’re grammars).

Token names are defined in declarations, in the format

%token NAME1 NAME2 ...

You can also define precedence with %left and %right. There are also other options you can set, like the start name with %start.

Grammars are in the form

name : TOKEN1 TOKEN2 ...
     ;

where alternation is defined like

name : TOKEN1
     | TOKEN2
     | TOKEN3
     ;

and it always must end with a ;. On each alternation, you can (should) specify a body, which will contain C code to be executed, if this rule is reached

name : TOKEN1 { C code here }
     | TOKEN2
     | TOKEN3 { C code here }
     ;

Another difference is that, since the parser needs a list of tokens, you cannot run it by itself, you’ll also need to setup a lexer.

Let’s implement a parser for the grammar8:

N -> 1 | 2 | 3
S -> N | (S + S) | (S * S)

First, our tokens, we want to replace all characters with a token. For 1, 2 and 3, let’s use the token name “NUMBER”, for “(“ and “)”, LBRACE and RBRACE accordingly, and since there is no precedence with this grammar, we don’t actually care if we have (S + S) or (S * S), so we can assign both + and * to the token MATHOP. Meaning, our grammar would behave more like this:

N -> 1 | 2 | 3
M -> + | *
S -> N | (S M S)

First, we need to make our lexer.l, it is going to be very simple:

%{
#include "parser.tab.h"
%}

%%

    /* Remember, [] is a group, [abcd] behaves like "(a | b | c | d)" and "\" escapes characters */
[123]  { return NUMBER: }
[\+\*] { return MATHOP; }
\(     { return LBRACE; }
\)     { return RBRACE: }

%%

Now, tokens (names) are defined inside grammars, however our lexer needs to emit (return) them. That’s why, when running bison on our parser.y, we’ll also create a C header file, which contains those names already defined, so they’re accessible inside lexer.l. This file is in the format FILENAME.tab.h, where FILENAME.y is our source yacc file.

Second, we need to make our parser.y:

%{
#include "stdio.h"
int yylex();
int yyerror(char *s);
%}

%token NUMBER MATHOP LBRACE RBRACE

%%

expr : NUMBER
     | LBRACE expr MATHOP expr RBRACE
     ;

%%

int yyerror(char *s) {
    printf("Error: %s\n", s);
    return 0;
}

int main() {
    yyparse();
}

A quirk with GNU Bison is that you need to declare the function yylex and yyerror beforehand, manually define yyerror and run yyparse in the main function. These are precisely the first 5 and the last 8 lines of code, but for now you don’t need to think about them.

Let’s get to compiling. First, the lexer:

flex lexer.l

then our parser:

bison -d parser.y

where the -d option means we’ll also output the parser.tab.h header file. Finally, we compile both of them together into one executable:

gcc lex.yy.c parse.tab.c -lfl

Now when you run ./a.out, if you enter a valid string, you’ll see a blank line, but if it isn’t valid, you’ll get “Error: syntax error”.

Doing it properly

The way we created that lexer and parser worked, but it wasn’t very useful and has multiple issues we have to address:

  1. When entering a string that isn’t matched at all, it gets printed back like nothing happened. For example, if we enter the letter a or any string which doesn’t have (, ), 1, 2, 3, + or - characters, it will just get printed back. Ideally, we want to raise an error, so we’re going to add one rule which catches any (non-special) character and throws an error with the built-in yy_fatal_error:

    . { yy_fatal_error("Bad character"); }
    

    This rule must be the last rule, since we don’t want it to be matched, if there is another rule that matches. Remember that lex matches the longest possible string and if there are multiple rules which match the same amount of characters, only the first one is matched.

  2. For every token name, returned by the lexer, we also need to have it’s actual value. We could make a struct, which holds both a token name and it’s value, but there is another “built-in” mechanism for that in yacc/Bison. We would set the global variable yylval to the token value, and it will be changed for every token.

    By default it’s type is int (in Bison), but usually you would want it to be a union. Let’s make it a union of int and char, though the first part is kind of useless, since we’re only working with single digit numbers.

    %union {
        int num;
        char chr;
    }
    

    Now, inside our lexer, rules would look like:

    RULE1 { yylval.num = 129; return TOKEN; }
    RULE2 { yylval.chr = 'a'; return TOKEN; }
    

    Inside C blocks for grammar rules we can reference the token values with $i, where i is the i’th “token” in the expression. For example, in this rule, $1 would be (, $3 would be value of token B and $5 would be the value of the rule name:

    rule : ( A B ) rule
         ;
    

    You can set rule “return” values by setting the variable $$:

    rule : LBRACE rule RBRACE { $$ = 10; }
    

    The way you set it’s type is discussed later. However, $i isn’t of type union, we cannot do something like $i.num, it is the precise value that was set for the token. Therefore, we also need to define which value from the union every token will set, which is done by adding <PROPERTY> before every defined token (and PROPERTY is a property name inside the union):

    %token <num> NUMBER <chr> MATHOP <chr> LBRACE <chr> RBRACE
    
  3. If we enter one valid math expression, we’ll get no error, but when we enter another, we’ll get a “syntax error”9. That is because we try to match only one expression, after it is matched our parser is waiting for an end of file character. Let’s make it so we would try to match an expression on every line:

    start : expr
          | expr start
          ;
    

    A rule like expr : expr expr will produce shift/reduce conflicts. Another option is duplicating our expr rules and adding expr in the beginning or end, but that is cumbersome if we want to make changes.

  4. We’re getting input from the user, but our source text (code) will be in a file. First, our main function needs to take program parameters:

    int main(int argc, char* argv[])
    

    Then we need to take the first argument, and open a file. Additionally, we have to check if the file was successfully opened, and print an error and stop execution if so:

    FILE* input = fopen(argv[1], "r");
    if (input == NULL) {
        yyerror("cannot read input file!");
        return 1;
    }
    

    Now for the clever bit, actual parsing in generated code is done on the file yyin, which by default is the standard user input. To read a file, all we need to do is change the value of that variable. Also, that variable is defined externally, so first we need to add to (the top) C section:

    extern FILE* yyin;
    

    and then put in our main:

    yyin = input;
    

    On *NIX systems (Linux, MacOS, OpenBSD, …) you can still use user input by passing the file /dev/stdin. If you want your front-end to support multiple files, you would just need to wrap the whole body of main in a for loop, iterating over the whole argv array (remember that argc is the length of the array).

  5. We need to construct an AST. Since the exact tree structure depends on your needs, I’ll make the simplest possible tree in pure C that works for this grammar. I won’t explain this in much detail, since it is just normal C code.

    #include <stdlib.h>
    #include <stdio.h>
    
    typedef struct node {
        char val;
        struct node *left;
        struct node *right;
    } node;
    
    node* newNode(char val, node *left, node *right) {
        node* out = malloc(sizeof(node));
        out->val = val;
        out->left = left;
        out->right = right;
        return out;
    }
    
    void printTree(node *n, int level) {
        if (n == NULL) return;
        // %*s will add level amounts of spaces (but with the "" string)
        printf("%*s|-%c\n", level, "", n->val);
        printTree(n->left, level + 1);
        printTree(n->right, level + 1);
    }
    
    void freeNode(node *n) {
        if (n == NULL) return;
        freeNode(n->left);
        freeNode(n->right);
        free(n);
    }
    

    However, all of this is quite a lot, and it would take a lot of space in parser.y, so let’s separate it into tree.c and add a header file tree.h which contains the definitions:

    #include <stdlib.h>
    #include <stdio.h>
    
    typedef struct node {
        char val;
        struct node *left;
        struct node *right;
    } node;
    
    node* newNode(char val, node *left, node *right);
    void printTree(node *n, int level);
    void freeNode(node *n);
    

    Of course, we should add some ifndef-def precompiler code, but that isn’t relevant for now.

    The way we’re going to use trees in our grammar is as follows: for the NUMBER expression of expr, we’ll create a node with the (char) value of the number and “return” it. For the expression of expr we’ll “return” a node with the value of the math operator and left and right will be the returned values. Inside start, after we’ve parsed an expr, we’ll print out the “returned” tree and free it’s memory.

    I’m putting returned in quotes, because (as mentioned), we’ll just be setting the value of $$. $$’s type is set with the %type option, which works exactly like %token, but with grammar rule names. However, this means that node must be available in %union, which is unfortunate but isn’t a problem, we’ll just add it.

    Now, finally, it’s time to include our tree.h in parser.y, but we’ll also need to include it in lexer.l. The reason is that the %union appears in parser.tab.h, which means our node structure is referenced in parser.tab.h, but tree.h isn’t included there.

All in all, we’ll have 4 source files for our simple front-end:

We compile everything:

flex lexer.l && bison parser.y && gcc lex.yy.c parse.tab.c tree.c

Now, let’s create the file test.txt:

((1+1) * ((3+3) * 2))

and run our front-end on it:

./a.out test.txt

Our output will be:

|-*
 |-+
  |-1
  |-1
 |-*
  |-+
   |-3
   |-3
  |-2

Perfect, we have a good looking tree!

Appendix

Originally I planned to put everything in one blog post, but writing the front-end (this part) alone took me 2 and a half weeks of (inconsistent, I was suddenly hit with multiple exams and homeworks in uni) writing. Additionally, while doing research, I found out the course actually skipped over a lot of important topics, which I’ll first need to research. Future parts will be linked right below the title at the very top, but it will probably be quite a while until I write them.

Many thanks to Dimitar Tagarev for his help with reviewing this post.


  1. Though it’s obvious, I would like to point out that modern compilers are most often written in languages that are themselves compiled. You might ask yourself “where did the first compiler come from?” The answer is simple: it was written with “numbers”, with direct instructions to the processor, from the time when computers were simple enough that all programming was done with CPU instructions. 

  2. Formally, we call the rule “names” nonterminals, the idea being that all normal characters, that could appear in our string, are called terminals, since they would terminate replacement and string generation. For the purposes of this post, I won’t use the formal names, with the hope that using less terminology will increase understandability. 

  3. This is pretty much the mathematical notation, often in CS the Backus-Naur form is used, but I find that a bit harder to read. You might also make the argument that I should use a proper unicode arrow symbol, like , but -> is way more convinient. 

  4. Such grammars are called regular grammars, more formally, every rule has to contain at most one name and that name must be at the start or end of the string characters. 

  5. Be careful about cases that are too complicated, the grammar I’m showing is “context-free”, but that topic is way beyond the scope of this post. 

  6. Yeah yeah, in a statically typed language you might not be able to do operations on doubles and integers at the same time, but you get the point. 

  7. The graphics are made with nomnoml by Daniel Kallin 

  8. You might be tempted to try and make one for the grammar inside the AST section, but that one is ever so slightly problematic, because it’s ambiguous (I showed why at the end of Implementational difficulties with grammars). Bison can handle ambiguity by always choosing to shift, but I didn’t want to complicate things. 

  9. You can add %define parse.error verbose inside definitions of your yacc file, and it will tell you what character was expected and what character was read.