Sane explanation of compilers: Part 1
Contents
Recently I signed up for a small university course about compilers. As I found out, most materials on the topic are either so surface level you barely learn anything, or so specific you need to already know the subject. My goal is to try and fill that void, explaining everything in a detailed way that is understandable for beginners.
Preface
#Processors work with numbers, every action (instruction) a processor can do is defined by numbers. However, writing code only with numbers is tedious, error-prone and simple stuff sometimes take ages to implement. That is why, over time, programmers have tried to abstract sets of instructions into words and text constructions that are easier to work with.
Thanks to abstraction, programmers have been able to create more and more complicated projects with less and less work, that is more portable and compatible with different machines. The trade-off is that we need to translate our nice programmer code to numbers a CPU would understand. This job is delegated to a special type of program, called a compiler1, which is separated into three main layers/components:
- Front-end: it understands the program, what the code is trying to do
- Middle: does optimizations and simplifications, though this part is often inside the back-end
- Back-end: generates the actual machine instructions
In this post, we’ll be looking only at the first one.
I. How do we understand code?
#Syntax in current programming languages can be quite sophisticated and ambiguous. For example (taken from wikipedia) this C++ snippet
void f(double d) {
int i(int(d));
}
could be interpreted either as a:
- declaration of variable
i
, which is initialized with the value ofint(d)
(which just castsd
to an integer), or - function declaration, since superfluous parenthesis around function parameter declarations are allowed (meaning it is treated as
int i(int d)
)
Note that I’m talking about syntax, the valid selection and layouts of characters, and not semantics, the underlying meaning. Compiled programming languages associate specific templates of code (text) to their “corresponding” machine instructions, there isn’t anything special you need to understand in a piece of code. In natural languages, you could construct a sentence which follows all grammatical rules, but is absolute nonsense, like “Colorless green ideas sleep furiously”.
The problem of semantics is extremely difficult, however in our case we only care about syntax. And this is where math comes into play.
Math
#A word of caution, I’ll use a decent amount of fancy names, so I have a way to more easily refer to different concepts. Don’t be scared and skip over them if they’re confusing you.
There is a special field in mathematics that deals with syntactical analysis: formal language theory.
Although even the basics merit a whole university course, I’ll still try to roughly explain what’s important, for those that haven’t studied it.
This field works primarily with strings via the help of two (relevant for our purposes) instruments: regular expressions and grammars.
Regular expressions
#You might’ve heard or even already used regexes in a text editor. They’re small text expressions which specify a text pattern. When a word can be “generated” from the pattern, we say the regex matches the word.
At their very base, such a pattern is created with the following rules:
concatenation - you can put characters next to each other to form strings, so a normal string like
accent
is also a regex.
With braces we can group stuff. For example, ina(cc)ent
the two “c"s are grouped together. This is used to form more sophisticated structures later on, but currently it doesn’t alter anything, like the braces in(1+2)+3
and1+(2+3)
don’t alter the result sum.alternation - allow different (alternative) strings in the same spot.
We do so with vertical bars (which signify an “or” relationship), the group can be “replaced” with either value. For exampleacce(n | p)t
, matches both “accent” or “accept”.
There is no limit on how many alternatives you can have and what their lengths are:acc(ordingly | ent | urate)
matches
- “accordingly”
- “accent” and
- “accurate”.
There is no limit on how many groups (with alternation) you can have:
ac(quirem | c | hievem)en(t | ts)
matches
- “acquirement”
- “acquirements”
- “accent
- “accents
- “achievement” and
- “achievements”.
Kleene star - repeat the preceding letter or group, 0 or more times.
An asterisk is used to mark this operation (hence the name, “Kleene star”).accents*
will match
- “accent”
- “accents
- “accentss
- “accentsss
- and so on
Of course, the asterisk could also be in the middle of the regex, like
ac*ent
which matches
- “acent”
- “accent”
- “acccent”
- …
With a group we can repeat entire strings,
acc(ent)*s
matches
- “accs”
- “accents”
- “accentents
- “accententents”
- …
An alternation allows any alternative to be chosen at any point, so
acce(n | p)*t
matches
- “accet”
- “accent
- “accept
- “accennt
- “accenpt”
- “accepnt”
- “acceppt”
- “accennnt”
- …
In the modern day, regular expressions permit additional constructions like matching the start or end of a string, matching character groups or character classes. There are also sometimes features which significantly increase the matching capabilities of normal regexes, like lookahead and lookbehind assertions.
Grammars
#However, even then regular expressions are fairly weak, they can’t match recursive structures. Let’s say you want to be able to have math expressions in your language, you can have braces inside braces as much as you like, but in a certain manner:
1+2
is a valid expression(9-8)+2
also is valid((3*3)-(2^3))+2
is too- however
()3*3)-(2^3()+2
isn’t valid
Such a structure is recursive, because it can contain “itself”, 1+2
can be a standalone math expression, but it can also be contained in another expression, like (1+2)*3
, which itself could be inside another bigger expressions and so on.
For this task we have a whole separate tool: grammars.
The syntax is different from regexes: a grammar is divided into (production) rules, each rule is named and specifies an expression. Expressions are a concatenation of strings and/or names2. Names won’t appear in our matched strings, the idea is that a name will be replaced with it’s expression, kind of like a placeholders. Additionally, there can be multiple rules with the same name, meaning a rule can be replaced by different stuff at any point of the string generation.
Always one name is chosen to be the starting point, and generation continues from there.
In math literature, generally we write rules in the form Name -> Expression
3, and often names are a single capital letter.
Additionally, often the starting name is called S
.
Finally, in future examples, I’ll use the arrow ⇒
to show derivations, how the generated string looks after one or two replacements of the name with one of it’s expressions.
For an example, with the grammar
S -> aSb
S -> cc
we can generate the string aaaaccbbbb
(remember, our starting point is S
):
S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaaaSbbbb ⇒ aaaaccbbbb
or accb
:
S ⇒ aSb ⇒ accb
where a, b and c are just English characters.
To simplify our lives, often rules with the same name are merged into one rule, separated by vertical bars (S -> aSb | cc
).
This is practically alternation but in a different context.
As for our math expressions, we could create something like this:
N -> 0 | 1 | 2 | ... all possible numbers
E -> N | (E) | E + E | E - E | E * E | E / E | E ^ E
which will successfully be able to match (generate) ((3 * 3) - (2 ^ 3)) + 2
:
E ⇒ E + E ⇒ (E) + N ⇒ (E - E) + 2 ⇒ ((E) - (E)) + 2 ⇒ ((E * E) - (E ^ E)) + 2 ⇒ ((N * N) - (N ^ N)) + 2 ⇒ ((3 * 3) - (2 ^ 3)) + 2
but there are no expressions (rules) that allow ()3+3()
or )3(-6
to be “matched”.
Lexers and parsers
#You might’ve noticed that for N
I use three dots and the note “all possible numbers” to signify that N can be any positive whole number.
In practice we need formally define that grammar:
N -> 0 | 1 | 2 | ... | 9 | 0N | 1N | 2N | ... | 9N
With this you might’ve realised that grammars can do anything a regular expressions can: concatenation and alternation are “built-in”, and the Kleene star can be done with another rule in the form R -> strR | str
4.
The reason we use both is that grammars are significantly slower and more complicated in code than regexes.
Implementational difficulties with grammars
#Implementing (very roughly) the regular expression acce(n | p)*t
is quite easy:
bool matchesp(string str) {
string regex = "acce_t";
for (int si = 0, ri = 0; si < str.size() && ri < regex.size(); si++, ri++) {
while (regex[ri] == '_')
if (str[si] == 'n' || str[si] == 'p')
si++;
else
ri++;
}
if (str[si] != regex[ri])
return false;
}
return true;
}
In brief, for concatenated characters we do a simple comparison per character.
When we reach “_”, which I use to specify we’re in the (n | p)*
part of the regex, we do comparisons with the characters n and p as much as we can, after which we continue as normal.
Of course, this version only supports one regular expressions, however a “proper” implementation would look conceptually similar: a couple of different loops and equality checks, with maybe an additional array or two. The huge benefit is that that loops and arrays are stupid fast, a (simple) C++ for loop could be translated into a handful of processor instructions.
On the other hand, usually grammars are more complicated and resource-intensive.
Certain ones are simple enough to implement, like S -> aSb | cc
:
bool matchesp(string str) {
if (str == "cc")
return true;
else if (str[0] == 'a' && str[str.size() - 1] == 'b')
return matchesp(str.substr(1, str.size() - 2));
else
return false;
}
However think about more complicated cases5, like S -> aSbSa | cc
.
Implementing this one is quite a bit more difficult, since we don’t know where b
is located on the first iteration, it could either be the 4th character accb...a
or the 9th aaccbccab...a
.
Searching for it is also complicated, it could be the first, the last or any other b
in the middle of the string.
Another difficulty with grammars is ambiguity, let’s look at
E -> E + E | E * E | n
how would it generate n + n * n
?
There are two completely valid ways to do so:
E ⇒ E + E ⇒ n + E * E ⇒ n + n * n
E ⇒ E * E ⇒ E + E * n ⇒ n + n * n
How exactly we deal with these problems comes later, for now the takeaway is that we have to try and use regular expressions wherever we can, and leave grammars as a bit of a last resort.
Practical usage
#Let’s think a bit more about grammars: they’re really good at handling structure, especially recursive structure, so it’s best if we use them only for that. Our math expression grammar (from here) shouldn’t really care if we’re trying to sum integers or floats or chars6, it’s important how everything is laid out and that our operations are done with numbers.
This is why we’re going to separate and generalize the code into tokens, basically into building blocks.
As a rough example, we want 1 + (2 * 3)
to be transformed into NUMBER MATHOP LBRACE NUMBER MATHOP NUMBER RBRACE
, int var = 13;
into TYPE NAME ASSIGNMENT NUMBER
and so on, where “TYPE”, “NUMBER”, “NAME”, … are tokens.
Now we’re reducing the amount of “objects” our grammars need to work on: from all characters to many (but significantly fewer) tokens.
Also we’re simplifying our work, since we’re trying to figure out the overall structure, it doesn’t matter if we’re multiplying or summing numbers, it’s important that we’re doing a mathematical operation on them.
Of course, each token will actually be a pair, containing it’s name (NUMBER, TYPE, …) and it’s value, but our grammars will work with the names only and output the actual values at the very end.
Usually we identify at least six token types:
- identifiers: variable and function names for example
- keywords:
if
,while
,for
,return
, … - separators: braces,
;
- operators:
+
,-
,<=
- literals: numbers, strings, characters
- comments
As you might’ve noticed, here we can use regular expressions, grouping consecutive characters is what regexes are good for. Additionally, we’re going to be reading the file, a decently slow operation, only once and for a comparatively very little time.
In the compiler front-end, the part that implements our regular expressions is called the “lexer”, and the part that implements our grammars - the “parser”. In the last section we already talked about how easy implementing regexes (lexers) is and how difficult implementing grammars (parsers) is, let’s look at the latter.
Parsing algorithms
#Proper algorithms that “implement” grammars are kind of complicated, and most can handle only specific types of grammars. Generally we categorize them into:
- top-down algorithms, which work in the way we’ve been talking about grammars: start with a name and replace it with it’s expression, and continue replacements until we have no more names
- bottom-up algorithms, where we do the opposite: for each substring we find from which rule it could’ve come from, replace it with the corresponding name, and continue upwards like that, until we reach the starting name
Furthermore, if our algorithms allow for ambiguity (meaning we can replace a name with any one of multiple expressions), we divide them into:
- left-most derivation algorithms, where we choose the left-most valid expression
- right-most derivation algorithms, where we chose the right-most valid expression
The advantage and disadvantages of every type kind of depend on the exact algorithm and grammar. Looking at some of the more famous ones:
LL
#LL: top-down left-most derivations algorithm that can parse only specific types of grammars. In these grammars, every expression is uniquely identifiable by just looking at the characters (tokens) in the very beginning. The amount of token we need to look for is very important, so we usually define LL(k) as an LL grammar that needs to look at k amount of tokens. An “LL(k) grammar” is, by extension, a grammar which can be parsed with an LL(k) parser.
So, the way we’ll check if an expression matches our grammar, is by starting with the start name and reading the next k character from the input. We check the first k character of all expressions, and if anyone of them is a name, we try with substitution for every possibility with that name, essentially peeking into the name’s possibilities.
As an example, the grammar:
F -> a S -> F | (S + F)
is LL(1). Let’s read the input
(a + a)
(the “v” points to the symbol we’re looking/peeking at):I v Input: (a + a) I v Production: S
we start with
S
, we peek and get the(
, in our production we’re looking at a name, S. Let’s check the first character of every expression of S. In the first expression, the first “character” isF
, a name, so we look at every expression forF
. There is only one, which is “a”, different than “(“, no dice. Now we check the second expression for S, it’s first character is “(“, success! We replace:I v Input: (a + a) I v Production: (S + F)
Now the first symbol in our production is also
(
, so we increment the symbols we’re looking/peeking atI v Input: (a + a) I v Production: (S + F)
we peek and get the
a
, in our production the second “character” is the nameS
,S
has to becomea
. We do that same check from before, first expression forS
isF
, whose first (and only) expression isa
, success!I v Input: (a + a) I v Production: (a + F)
The middle characters “ + “ are matched and at the second
F
we check it’s expressions, it has onlya
, success, replace, end of input, we’re done.Another LL(1) example can be found in Karl R. Abrahamson notes.
The grammar
B -> b C -> c S -> aaB | aaC
is LL(3). For the input
aab
, if we look at the first two “a"s, we still don’t know which rule to use:aaB
oraaC
. However if we look at the third symbol, “b”, then we can determine that we needaaB
.Usually parsers that implement this procedure are called recursive descent parsers
LR
#LR: bottom-up right-most derivations algorithm. Conceptually, LR works in the exact opposite way to LL: for every character in the input, we find a rule that substitutes it (or that substitutes the current collection of names and character) and do so until we’ve reached the starting name.
Figuring out an approach isn’t as intuitive and easy as with LL, but it still isn’t too hard. It all boils down to using a whole bunch of conditional statements and tracing where we “came” from.
Let’s look at the first LL example:
F -> a S -> F | (S + F)
We need to track the current state of the string, because imagine we’ve just read an “a”, it could either be from the left or right handside of “a + a”, and that changes what character to expect next. We’re going to track the state with different numbers, defined like this:
State (number) Current expression ends with 0 Nothing; this is the start 1 The name S
; the final number should be this2 The name F
3 The character “(“ 4 The character “a” 5 “(S” 6 “(S + “ 7 “(S + F” 8 “(S + F)”; this is the one before the final You can think of every state number as the valid, but “unfinished” state of the current string. For now don’t think about the way this table is defined, it’ll make sense later. But you could make a state for literally every valid string composition: “(“, “(a”, “(F”, “(S”, ….
Keep in mind that the way I’ll show the algorithm step by step isn’t really correct, but will suffice for the current moment. Reading a character from the input means appending it to the right of String:
String : ( State : 0
We’ve read “(“, so our string “ends” on a “(“.
String : (a State : 3
We’ve read “a”, our string “ends” on an “a”.
String : (a State : 4
Here comes the clever bit, for certain states, we’ll stop reading forward and do something else. If we’re on state 4, our string ends on an “a”, so we’re going to go backwards and replace it with an F, meaning the current state should be 2.
String : (F State : 2
Here is another such rule, if our state is 2, our string ends with an “F” and we replace it with an S, now the state is 5. That’s why we have a different state for “(S + F”, if we didn’t, our state would become 2 and we would make “(S + S”, which isn’t valid. So, think of 2 as “this is the first F”.
String : (S State : 5
We have a different state for “(S”, because if we didn’t, we would’ve gone to state 1, and could’ve ended the algorithm then and there, without reading the rest of the input. Otherwise, we continue as normal, we read the
+
and the spaces around it, going to state 6.String : (S + a State : 6
Now, our string ends with an “a”, it’s tempting to change our state to 4, but we shouldn’t, this would remove our progress so far! Better yet, this state will also be special, if we have a 6 and we end with “a”, rather than going to 4, we’ll substitute “a” with F now and change state to 7.
String : (S + F State : 7
And now it’s trivial, we read “)”, so we go to state 8. But if we’re at state 8, don’t continue reading, replace with state 1. Now that we’re at state 1, if we read any character (including end of string), we’ll say the string up to now matches and we’re done!
It’s also helpful to put what we do in different situations in another table:
State If ends on Change state to Also do 0 “(“ 3 1 Nothing, we matched 2 5 Replace “F” with “S” 3 “a” 4 4 2 Replace “a” with “F” 5 “ + “ 6 6 “a” 7 Replace “a” with “F” 7 “)” 8 8 1 Replace “(S + F)” with “S” where, if there is a value in “If ends on”, we read the next character, while otherwise we don’t and we just execute “Change state to” and “Also do”.
You’re probably wondering why the table works so weird, like why on state 3 we don’t directly replace “a” with “S” and change to state 5 for example. The reason is that the way I showed the algorithm isn’t exactly how it normally works. Rather than tracing the current state in a variable, we trace it with a stack.
When you read a character and have to go to another state, the state is added to the stack (this operation is called “shift”). When we have to change characters from String (without reading anything), with a string (expression) of length j, we remove the last j added states and do a normal shift from the last state and character in String (this operation is called “reduce”).
Let’s look at the same example, done with shift and reduce operations (without me showing the proper table), where pushing and popping items from the stack is done from the right side of the array:
String : ( State : [0]
We’ve read “(“, add 3 on top of the stack aka shift(3). Then we read the “a”, shift(4).
String : (a State : [0, 3, 4]
With 4 we’ll do a reduction, “a” gets replaced by F, 4 is removed:
String : (F State : [0, 3]
If last state is 3 and last character is “F”, we’re doing shift(2).
String : (F State : [0, 3, 2]
Now, we’ll be doing a reduction, where F is replaced by S:
String : (S State : [0, 3]
If last state is 3 and last character is “S”, we’re dong shift(5).
And so on, we continue. As you can see, we go back and forth, from 3 to 4, back to 3, then to 2, back to 3, then to 5. This way we avoid doing different things on different states and we don’t need to store the whole string, only the last few characters. Additionally, in the real world nobody makes this table by hand, it is generated by a program, and this method makes table generation easier.
As for the complete table, you can view it on the Context Free Grammar Checker,
or with a more pretty styling on Grammophone, opening “Parsing table” on the row “LR(0)”.To read the table, remember that:
- on every row you have each state
- on every column you have each possible value in the string and “$” which signifies the end of the string
- with
sr
orshift
(or just with a number) you go to another state - with
r
orreduce
you make the substitution in the current string and go back one state acc
oraccept
means you’ve reached the end and the string is valid
LR’s family
#LR is more of a family of algorithms, because there are multiple different ways to generate the table, each with it’s own benefits and drawbacks. This subject is a really big can of beans, which I’m not knowledgeable enough to talk about. What you need to know is that there are 3 often used types: CLR, SLR and LALR.
- CLR is an LR parser with 1 character lookahead (LR(1)), following Knuth’s original theory. It requires a lot of memory for table generation, so Frank DeRemer created two simplifications:
- the SLR and LALR, which use significantly less memory but work with a bit less grammars
LALR is the most commonly used LR parser, however most compilers are custom-tailored pieces of software that very loosely follow the methods of LL and LR.
More info
#If you want to learn more, I would highly suggest:
- LL and LR Parsing Demystified and LL and LR in Context: Why Parsing Tools Are Hard by Josh Haberman
- Parsing: Top-down versus bottom-up by Jeffrey Kegler
- Which Parsing Approach? by Laurence Tratt
- A Guide to Parsing: Algorithms and Terminology by Gabriele Tomassetti
- Parsing, Top-Down Parsers and Bottom-Up Parsers from Karl R. Abrahamson’s notes
- How to Impement an LR(1) Parser by Kirill Andreev
Abstract syntax tree
#We already decided that the lexer will output tokens (and characters) to the parser, but we didn’t talk about what the parser is supposed to output.
It definitely has to be some sort of structural representation of the overall code, for example, for 3 + 1 * 2
we have to be able to easily find out what the left and right handsides of the +
and *
are.
Let’s draw out7 this relationship, where the left arrow connects to the left handside and the right accordingly:
Hey, that looks an awful lot like a tree! Trees are nodes, some sort of elements, where there is a connection from one element to some other elements, with the only rule being that cycles aren’t allowed:
Great, but can we represent everything else with a tree? Why, of course, for example we can represent:
if (a > b)
c = 5;
else
d += 10;
as
You might ask how we generate the tree, and the answer is simple: on every step we add nodes to the tree, depending on the situation. Let’s look at a familiar example:
N -> 1 | 2 | 3
E -> N | (E) | E + E | E * E
Let’s look at the input string 2 * (1 + 3)
, while abstracting ourselves from any particular algorithm:
we read the
2
, that is a number, we add it as a nodethen we read the
*
, we now know our expression is in the formE * E
. If you think about it, up to this point, we’ve basically been building a tree for the left handside.If the topmost element in that tree is a brace,
we connect it to the left handside of
*
:Otherwise (when it’s
+
or*
or a number), because of precedence, we’ll attach the latest number to the left handside. However we have to be careful, if something connects to the number, we now need it to connect to the*
.
For example, let’s say we had1 + 2 * 3
, and we’ve just read the*
. The tree up to this point would look like,where the latest number we’ve found was
2
, but the+
connects to it. So, we need to make the2
connect to the left handside of*
, and make+
connect to*
instead of2
In our original case, nothing connects to the
2
in2 * (1 + 3)
, so we can just connect it to the left handside of*
where the three dots signify there isn’t anything there (but we expect there to be). Somewhere we’ll store that the next element we find has to be connected to the right handside.
now we read
(
, we cannot have it by itself, so we directly decide we’re working with(E)
(in our full algorithm we would need a check for existance of matching brace). We’ll create a()
node and because of the last step, we’ll connect it to the right handside.Now, moving forward, we’ll be creating a subtree, essentially making a second tree, and after that is done, we’ll connect it’s topmost element (root) to the
()
node.and so on.
The overall algorithm is to add and change nodes every time you match something with a grammar rule, and the adding and changing is done, depending on the language, the current structure and so on.
When discussing the middle and back-end layers, we’ll talk about how to work with a generated AST, but for now your clue is the visitor pattern.
Implementing a front-end
#If you want, you could implement your own lexer and parser manually, but as a start, it’s worthwhile to try and use already created stuff. Compiler-compilers are tools that create parsers (and more) from their normal formal definition.
The two popular ones from the old days are lex and yacc. lex is a lexer generator and yacc is a LALR parser generator, both of which generate a corresponding C implementation of the given regular expressions and grammars.
lex
#Lex is the original program, from 1975, in the modern day it’s successor/implementation is the open-source flex.
A lex source file is divided into three parts:
%{
// C code that will be put in the program verbatim
%}
/* Definitions */
%%
/* Rules */
%%
// C code
where %{%}
is used for includes and declarations, the rules are your regular expressions and the C code is the general logic of the lexer.
Usually you wouldn’t write anything in the C code section, the defaults are fine, but you’ll probably need to do one or two includes in the %{%}
section.
Rules are in the form regex { C_code }
.
To simplify your life, you can write only the regex or a part of it inside the “definitions” part of the file, and give it a name to use later.
Referencing those names in later regexes is done by surrounding them in curly braces.
Usually the C_code
in the regex would be a return (more commonly called “emit”) of a token, which will be available in a header file, but more on that later.
Let’s look at a very simple example program: every whole number is printed as “NUMBER”, while everything else is printed as-is:
%{
%}
%%
[0-9]+ { printf("NUMBER"); }
[A-z]+ { printf("%s", yytext); }
%%
The syntax [0-9]
is the same as (0 | 1 | ... | 9)
, it works for any letters and is available in all modern regexes.
Note that inside every regex’s “body” you have an assortment of variables already set, one of them being yytext
which is just the whole match.
Additionally lex matches the longest possible match, so if we had a rule [0-z]
, only it will be matched.
If this file is located in lexer.l
, we can transform it into a C file with:
flex lexer.l
Usually the result file is always named lex.yy.c
.
Afterwards, we can compile the C code into a working compiler:
gcc lex.yy.c -lfl
Without the -lfl
flag, we would get a linking error, because of the main
and yywrap
function, which are required but aren’t defined in our generated code.
There are three ways to fix this problem:
The one I’ve opted for, use the
-lfl
flag to compile our C code with a (flex) library which provides both functions. Usually the lexer isn’t meant to be ran by itself, so for the current time being,-lfl
is the best option, the other two are fine when we get toyacc
.Put
%option noyywrap
in the definitions section, which means the lexer will only read one file.Define it ourselves. It should return 0 if our lexer should continue with the next given file, after finishing with the current one, or 1 if it should stop execution. The function could be written in either “C code” section, but I personally prefer the last one.
int yywrap() { return 0; }
Of course, you can use your lex implementation or C compiler of choice, and there this might not be a problem.
Now, if you run ./a.out
you would get a prompt, every line you write will be lexically analyzed and in our case, everything will be printed out.
Entering
The number 10 changed to 8, 38, and 19 sequentially, before becoming 0.
will output
The number NUMBER changed to NUMBER, NUMBER, and NUMBER sequentially, before becoming NUMBER.
Now let’s try something more sophisticated, let’s surround numbers and strings in HTML tags, the stuff that power webpages. We would want numbers to be in the format:
<span class="num">number</span>
and strings:
<span class="str">string</span>
We’ll define numbers as both whole digits and floating point numbers (with sign) and strings can be surrounded in either single or double quotes.
%{
%}
/* Only multiline comments are supported, not comments that start with // */
/* This is a definition, it's in the format "NAME regex" */
DIGIT [0-9]
%%
/* Comments in here have to be indented, otherwise you'll get an error */
/* ? means 0 or 1 of the preceding group*/
/* + means 1 or more of the preceding group*/
/* the backward slashes are used because . and + have special meaning in regexes*/
[\+-]?{DIGIT}+(\.{DIGIT}+)? { printf("<span class=\"num\">%s</span>", yytext); }
/* [^] is a group of characters, like [], however it specifies "all characters except the ones given"*/
/* this way if we have "This is a string with " three quotes" it's not going to match the whole thing*/
\"[^"]*\"|'[^']*' { printf("<span class=\"str\">%s</span>", yytext); }
%%
With the input
John said 10 times "2 + 2 = 5" but '2 + 2 = 4'!
we’ll get
John said <span class="num">10</span> times <span class="str">"2 + 2 = 5"</span> but <span class="str">'2 + 2 = 4'</span>!
You’ve probably already noticed that the number inside the strings aren’t matched, because the string rule results in a longer match. In certain programming languages, that don’t have string interpolation this is fine, but for others our regexes have to be a lot more complicated.
yacc
#Similarly to lex, yacc originates from 1975, and it’s modern replacement is GNU Bison.
A yacc file looks very similarly to a lex one:
%{
// C code that will be put in the program verbatim
%}
/* Declarations */
%%
/* Rules */
%%
// C code
however rather than definitions you have declarations, and the rules are in a completely different format (because they’re grammars).
Token names are defined in declarations
, in the format
%token NAME1 NAME2 ...
You can also define precedence with %left
and %right
.
There are also other options you can set, like the start name with %start
.
Grammars are in the form
name : TOKEN1 TOKEN2 ...
;
where alternation is defined like
name : TOKEN1
| TOKEN2
| TOKEN3
;
and it always must end with a ;
.
On each alternation, you can (should) specify a body, which will contain C code to be executed, if this rule is reached
name : TOKEN1 { C code here }
| TOKEN2
| TOKEN3 { C code here }
;
Another difference is that, since the parser needs a list of tokens, you cannot run it by itself, you’ll also need to setup a lexer.
Let’s implement a parser for the grammar8:
N -> 1 | 2 | 3
S -> N | (S + S) | (S * S)
First, our tokens, we want to replace all characters with a token.
For 1, 2 and 3, let’s use the token name “NUMBER”, for “(“ and “)”, LBRACE and RBRACE accordingly, and since there is no precedence with this grammar, we don’t actually care if we have (S + S)
or (S * S)
, so we can assign both +
and *
to the token MATHOP.
Meaning, our grammar would behave more like this:
N -> 1 | 2 | 3
M -> + | *
S -> N | (S M S)
First, we need to make our lexer.l
, it is going to be very simple:
%{
#include "parser.tab.h"
%}
%%
/* Remember, [] is a group, [abcd] behaves like "(a | b | c | d)" and "\" escapes characters */
[123] { return NUMBER: }
[\+\*] { return MATHOP; }
\( { return LBRACE; }
\) { return RBRACE: }
%%
Now, tokens (names) are defined inside grammars, however our lexer needs to emit (return) them.
That’s why, when running bison on our parser.y
, we’ll also create a C header file, which contains those names already defined, so they’re accessible inside lexer.l
.
This file is in the format FILENAME.tab.h
, where FILENAME.y
is our source yacc file.
Second, we need to make our parser.y
:
%{
#include "stdio.h"
int yylex();
int yyerror(char *s);
%}
%token NUMBER MATHOP LBRACE RBRACE
%%
expr : NUMBER
| LBRACE expr MATHOP expr RBRACE
;
%%
int yyerror(char *s) {
printf("Error: %s\n", s);
return 0;
}
int main() {
yyparse();
}
A quirk with GNU Bison is that you need to declare the function yylex
and yyerror
beforehand, manually define yyerror
and run yyparse
in the main function.
These are precisely the first 5 and the last 8 lines of code, but for now you don’t need to think about them.
Let’s get to compiling. First, the lexer:
flex lexer.l
then our parser:
bison -d parser.y
where the -d
option means we’ll also output the parser.tab.h
header file.
Finally, we compile both of them together into one executable:
gcc lex.yy.c parse.tab.c -lfl
Now when you run ./a.out
, if you enter a valid string, you’ll see a blank line, but if it isn’t valid, you’ll get “Error: syntax error”.
Doing it properly
#The way we created that lexer and parser worked, but it wasn’t very useful and has multiple issues we have to address:
When entering a string that isn’t matched at all, it gets printed back like nothing happened. For example, if we enter the letter
a
or any string which doesn’t have(
,)
,1
,2
,3
,+
or-
characters, it will just get printed back. Ideally, we want to raise an error, so we’re going to add one rule which catches any (non-special) character and throws an error with the built-inyy_fatal_error
:. { yy_fatal_error("Bad character"); }
This rule must be the last rule, since we don’t want it to be matched, if there is another rule that matches. Remember that lex matches the longest possible string and if there are multiple rules which match the same amount of characters, only the first one is matched.
For every token name, returned by the lexer, we also need to have it’s actual value. We could make a
struct
, which holds both a token name and it’s value, but there is another “built-in” mechanism for that in yacc/Bison. We would set the global variableyylval
to the token value, and it will be changed for every token.By default it’s type is
int
(in Bison), but usually you would want it to be a union. Let’s make it a union of int and char, though the first part is kind of useless, since we’re only working with single digit numbers.%union { int num; char chr; }
Now, inside our lexer, rules would look like:
RULE1 { yylval.num = 129; return TOKEN; } RULE2 { yylval.chr = 'a'; return TOKEN; }
Inside C blocks for grammar rules we can reference the token values with
$i
, wherei
is the i’th “token” in the expression. For example, in this rule,$1
would be(
,$3
would be value of tokenB
and$5
would be the value of the rule name:rule : ( A B ) rule ;
You can set rule “return” values by setting the variable
$$
:rule : LBRACE rule RBRACE { $$ = 10; }
The way you set it’s type is discussed later. However,
$i
isn’t of type union, we cannot do something like$i.num
, it is the precise value that was set for the token. Therefore, we also need to define which value from the union every token will set, which is done by adding<PROPERTY>
before every defined token (andPROPERTY
is a property name inside the union):%token <num> NUMBER <chr> MATHOP <chr> LBRACE <chr> RBRACE
If we enter one valid math expression, we’ll get no error, but when we enter another, we’ll get a “syntax error”9. That is because we try to match only one expression, after it is matched our parser is waiting for an end of file character. Let’s make it so we would try to match an expression on every line:
start : expr | expr start ;
A rule like
expr : expr expr
will produce shift/reduce conflicts. Another option is duplicating ourexpr
rules and addingexpr
in the beginning or end, but that is cumbersome if we want to make changes.We’re getting input from the user, but our source text (code) will be in a file. First, our
main
function needs to take program parameters:int main(int argc, char* argv[])
Then we need to take the first argument, and open a file. Additionally, we have to check if the file was successfully opened, and print an error and stop execution if so:
FILE* input = fopen(argv[1], "r"); if (input == NULL) { yyerror("cannot read input file!"); return 1; }
Now for the clever bit, actual parsing in generated code is done on the file
yyin
, which by default is the standard user input. To read a file, all we need to do is change the value of that variable. Also, that variable is defined externally, so first we need to add to (the top) C section:extern FILE* yyin;
and then put in our
main
:yyin = input;
On *NIX systems (Linux, MacOS, OpenBSD, …) you can still use user input by passing the file
/dev/stdin
. If you want your front-end to support multiple files, you would just need to wrap the whole body of main in afor
loop, iterating over the wholeargv
array (remember thatargc
is the length of the array).We need to construct an AST. Since the exact tree structure depends on your needs, I’ll make the simplest possible tree in pure C that works for this grammar. I won’t explain this in much detail, since it is just normal C code.
#include <stdlib.h> #include <stdio.h> typedef struct node { char val; struct node *left; struct node *right; } node; node* newNode(char val, node *left, node *right) { node* out = malloc(sizeof(node)); out->val = val; out->left = left; out->right = right; return out; } void printTree(node *n, int level) { if (n == NULL) return; // %*s will add level amounts of spaces (but with the "" string) printf("%*s|-%c\n", level, "", n->val); printTree(n->left, level + 1); printTree(n->right, level + 1); } void freeNode(node *n) { if (n == NULL) return; freeNode(n->left); freeNode(n->right); free(n); }
However, all of this is quite a lot, and it would take a lot of space in
parser.y
, so let’s separate it intotree.c
and add a header filetree.h
which contains the definitions:#include <stdlib.h> #include <stdio.h> typedef struct node { char val; struct node *left; struct node *right; } node; node* newNode(char val, node *left, node *right); void printTree(node *n, int level); void freeNode(node *n);
Of course, we should add some ifndef-def precompiler code, but that isn’t relevant for now.
The way we’re going to use trees in our grammar is as follows: for the
NUMBER
expression ofexpr
, we’ll create a node with the (char) value of the number and “return” it. For the expression ofexpr
we’ll “return” a node with the value of the math operator and left and right will be the returned values. Inside start, after we’ve parsed anexpr
, we’ll print out the “returned” tree and free it’s memory.I’m putting returned in quotes, because (as mentioned), we’ll just be setting the value of
$$
.$$
’s type is set with the%type
option, which works exactly like%token
, but with grammar rule names. However, this means thatnode
must be available in%union
, which is unfortunate but isn’t a problem, we’ll just add it.Now, finally, it’s time to include our
tree.h
inparser.y
, but we’ll also need to include it inlexer.l
. The reason is that the%union
appears inparser.tab.h
, which means our node structure is referenced inparser.tab.h
, buttree.h
isn’t included there.
All in all, we’ll have 4 source files for our simple front-end:
lexer.l
%{ #include "tree.h" #include "parser.tab.h" %} %option noyywrap %% [123] { yylval.num = atoi(yytext); return NUMBER; } [\+\*] { yylval.chr = yytext[0] ; return MATHOP; } \( { yylval.chr = '(' ; return LBRACE; } \) { yylval.chr = ')' ; return RBRACE; } \ { } /* Don't do anything on spaces */ . { yy_fatal_error("Bad character"); } %%
parser.y
%{ #include <stdio.h> #include "tree.h" extern FILE* yyin; int yylex(); int yyerror(const char *s); %} %union { int num; char chr; node *n; } %token <num> NUMBER <chr> MATHOP <chr> LBRACE <chr> RBRACE %type <n> expr %% start : expr { printTree($1, 0); freeNode($1); } | expr start { printTree($1, 0); freeNode($1); } ; expr : NUMBER { $$ = newNode($1 + '0', NULL, NULL); } | LBRACE expr MATHOP expr RBRACE { $$ = newNode($3, $2, $4); } ; %% int yyerror(const char *s) { printf("Error: %s\n", s); return 0; } int main(int argc, char* argv[]) { FILE* input = fopen(argv[1], "r"); if (input == NULL) { yyerror("reading input file!"); return 1; } yyin = input; yyparse(); }
tree.h
#ifndef TREE #define TREE #include <stdlib.h> #include <stdio.h> typedef struct node { char val; struct node *left; struct node *right; } node; node* newNode(char val, node *left, node *right); void printTree(node *n, int level); void freeNode(node *n); #endif
tree.c
#include "tree.h" node* newNode(char val, node *left, node *right) { node* out = malloc(sizeof(node)); out->val = val; out->left = left; out->right = right; return out; } void printTree(node *n, int level) { if (n == NULL) return; printf("%*s|-%c\n", level, "", n->val); printTree(n->left, level + 1); printTree(n->right, level + 1); } void freeNode(node *n) { if (n == NULL) return; freeNode(n->left); freeNode(n->right); free(n); }
We compile everything:
flex lexer.l && bison parser.y && gcc lex.yy.c parse.tab.c tree.c
Now, let’s create the file test.txt
:
((1+1) * ((3+3) * 2))
and run our front-end on it:
./a.out test.txt
Our output will be:
|-*
|-+
|-1
|-1
|-*
|-+
|-3
|-3
|-2
Perfect, we have a good looking tree!
Appendix
#Originally I planned to put everything in one blog post, but writing the front-end (this part) alone took me 2 and a half weeks of (inconsistent, I was suddenly hit with multiple exams and homeworks in uni) writing. Additionally, while doing research, I found out the course actually skipped over a lot of important topics, which I’ll first need to research. Future parts will be linked right below the title at the very top, but it will probably be quite a while until I write them.
Many thanks to Dimitar Tagarev for his help with reviewing this post.
-
Though it’s obvious, I would like to point out that modern compilers are most often written in languages that are themselves compiled. You might ask yourself “where did the first compiler come from?” The answer is simple: it was written with “numbers”, with direct instructions to the processor, from the time when computers were simple enough that all programming was done with CPU instructions. ↩
-
Formally, we call the rule “names” nonterminals, the idea being that all normal characters, that could appear in our string, are called terminals, since they would terminate replacement and string generation. For the purposes of this post, I won’t use the formal names, with the hope that using less terminology will increase understandability. ↩
-
This is pretty much the mathematical notation, often in CS the Backus-Naur form is used, but I find that a bit harder to read. You might also make the argument that I should use a proper unicode arrow symbol, like
→
, but->
is way more convinient. ↩ -
Such grammars are called regular grammars, more formally, every rule has to contain at most one name and that name must be at the start or end of the string characters. ↩
-
Be careful about cases that are too complicated, the grammar I’m showing is “context-free”, but that topic is way beyond the scope of this post. ↩
-
Yeah yeah, in a statically typed language you might not be able to do operations on doubles and integers at the same time, but you get the point. ↩
-
The graphics are made with nomnoml by Daniel Kallin ↩
-
You might be tempted to try and make one for the grammar inside the AST section, but that one is ever so slightly problematic, because it’s ambiguous (I showed why at the end of Implementational difficulties with grammars). Bison can handle ambiguity by always choosing to shift, but I didn’t want to complicate things. ↩
-
You can add
%define parse.error verbose
inside definitions of your yacc file, and it will tell you what character was expected and what character was read. ↩