Today we have a bit of a change of pace:
let's come back to the very beginning of our compiler pipeline,
the stage between the string input we receive and the AST we send to the compiler and interpreter.
We talked early on in the semester about what an AST was and why it was useful to us.
There's not much interesting going on in the transformation from the s-expression
data type to the
expr AST data type.
But we left the string-to-s-expression transformation as a black box.
Traditionally, this has been a huge part of compiler design, one of the more theory-heavy parts of the study. We're not going to do it justice with one lecture. We couldn't even do it justice with two weeks. This is just a taste.
Tokenization vs parsing
Our goal is go go from user input (represented as a bare string)
to some tree-structured data that our compiler can work with, e.g. our OCaml
There are actually two steps in the pipeline here: tokenization and parsing.
Tokenization splits the input string into a list of tokens,
but being a list, this structure remains flat.
The parsing step turns this flat representation into a tree,
where the nested nature of the program is clear.
What is a token? It could be a parenthesis, or a number, or a symbol.
"(+ 1 (- 3 2))" ----> (tokenizing) [LPAREN; SYM "+"; LPAREN; SYM "-"; NUM 3; NUM 2; RPAREN; RPAREN] ----> (parsing) Lst [Sym "+"; Num 1; Lst [Sym "-"; Num 3; Num 2]]
Grammar of s-expressions
S_exp type describes much of how our programs look, but it's not complete.
In particular it doesn't have anything to say about parentheses.
So let's be a little more precise:
we'll write down what's known as a context-free grammar for our programs.
(This way of describing grammar is known as Backus-Naur form,
and you've seen it at the bottom of each homework handout.)
<s_exp> ::= | NUM (n) | SYM (s) | LPAREN <lst> RPAREN <lst> ::= | ε | <s_exp> <lst>
The symbols that appear in angle brackets, like
<s_exp>, are nonterminals.
These will not appear in the program string directly,
but are useful for expressing our grammar.
The other symbols like
SYM are terminals.
Each rule (beginning with
|) is called a production rule.
Much like our OCaml data types, we can read these as a list of alternatives:
s_exp is either a number, or a symbol, or a list enclosed in parentheses.
But it is also helpful to think of them as enumerating the ways to create an object,
hence the name "production rule."
For example, suppose we want to create an
We can pick any rule, say, the third, which says to write an
LPAREN, then create a
lst, then write an
And we see how to create a
lst in the following lines.
Here's an example of a larger s-expression created with production rules:
The leaves are all terminals (with associated data), and the internal nodes are all nonterminals. What s-expression does this correspond to?
The "reverse" way to think about this is as a parse tree. Starting with a given string, what does it take to convince ourselves it corresponds to an s-expression? We'd have to show that there's a way of applying a sequence of production rules that creates that string.
We're going to start with a very bad tokenizer. Let's not think too hard about it; we'll improve it later.
type s_exp = Num of int | Sym of string | Lst of s_exp list type token = NUM of int | SYM of string | LPAREN | RPAREN exception ParseError let token_of_string (s : string) = match s with | "(" -> LPAREN | ")" -> RPAREN | _ -> ( try NUM (int_of_string s) with _ -> SYM s ) let tokenize (s : string) = s |> String.split_on_char ' ' |> List.map token_of_string
We'll use a bunch of helper functions to write our parser.
These will be chosen in a principled way:
each nonterminal in our grammar will have its own parse function, so,
Each will consume a list of tokens,
and return a list of tokens remaining to be processed.
(A parser doesn't need to consume all of its input!)
In the case of
parse_s_exp, we'll also return an
parse_list, we'll also return an
let rec parse_s_exp (toks : token list) : s_exp * token list = match toks with | NUM n :: toks -> (Num n, toks) | SYM s :: toks -> (Sym s, toks) | LPAREN :: toks -> let exps, toks = parse_lst toks in (Lst exps, toks) | _ -> raise ParseError and parse_lst (toks : token list) : s_exp list * token list = match toks with | RPAREN :: toks -> (, toks) | _ -> let exp, toks = parse_s_exp toks in let exps, toks = parse_lst toks in (exp :: exps, toks)
Our grammar has a really nice property: the first token we look at determines which production rule, if any, will match! This means that our helper functions really look very similar to the BNF grammar we wrote above.
One interesting thing to notice is the first case in parse_lst. We wanted to look for the empty string, but how would we look for the empty string, since it’s just ""? Our tokenizer isn’t going to produce that for us. Instead we had to think about what would actually appear next if we had an empty list—which is to say, a right paren! We were able to figure that out by looking at where
<lst> is used on the right side of any prouction rules. In this case,
<lst> is used exactly once, in the last production rule that has
<s_exp> on the left side:
<s_exp> ::= LPAREN <lst> RPAREN. In future class sessions, we’ll see how we can use this same approach with other grammars.
For our second
<lst> case, things are a bit simpler. We know from the grammar that we expect to see an
<s_exp> and then another
<lst>. We already have helper functions for parsing those, so we’ll go ahead and call them.
let parse (s : string) = let toks = tokenize s in let exp, l = parse_s_exp toks in if List.length l = 0 then exp else raise ParseError
The tokenizer we used above was terrible. Traditionally, we don't bulid tokenizers like this. Instead a common approach is to use regular expressions.
Think of a regex as a "string tester." Given a regex and a string, they will either match or not. The grammar of regular expressions is simple and can be written in BNF: (note, this is shorthand and not fully precise)
<regex> ::= | <string> | <regex><regex> | (<regex>|<regex>) | <regex>*
The first production rule just takes a string literal and matches it literally.
For example, the regex
abc would match only the string
The second rule concatenates two regular expressions.
The third rule, as
| often does, stands for disjunction.
abc|def matches both the strings
The fourth ("Kleene star") matches zero or more occurences of the given regex.
a* would match
For convenience, we'll introduce some shorthand:
X+ := XX* X? := (X|ε) (X | Y | Z | ...) := (X | (Y | (Z | ...)))
How could we write a regex that would match numbers?
Maybe something like this:
And we can continue with this idea. To build a tokenizer for our language, we'll write a regexp (a "recognizer") that matches each token.
NUM := -?(0|1|2|3|4|5|6|7|8|9)+ SYM := (a|b|...|A|B|...|+|-|...)+ LPAREN := ( RPAREN := )
This approach can parse a string of length
But: then we need to parse it into a syntax tree.
The general problem here --
deciding whether a string of length
n is in the language described by an arbitrary context-free grammar --
is not solvable in
There are subsets of context-free grammars that can be parsed in linear time. Broadly speaking, there are two types of these, "top-down" parsers (often written by hand, e.g. LL(k)) and "bottom-up" parsers (often automatically generated, e.g. LR(k), LALR(k)).
Top-down vs bottom-up
We're going to write our parser by hand, so let's focus on the top-down variant. As an upside, it will be flexible and intuitive, and allow for good error messages; as a downside, we may have to adjust our grammar. These parsers generally share some features, some of which we've seen already: we'll use one function per non-terminal in our grammar, and these functions will call each other and themselves ("recursive descent").
The parser we're going to write is an example of an LL(1) parser. The 1 stands for the number of tokens we'll look ahead. Given a list of tokens, we'll want to figure out what production rule to apply; we'll make that decision based only on the first token in the list.
For a grammar that does not have this property, consider a grammar for arithmetic expressions:
<expr> ::= | NUM | <expr> + <expr> | <expr> * <expr> | (<expr>)
This infix notation is a lot harder to parse than our s-expression language. It's ambiguous! These expressions (and many others) correspond to multiple parse trees:
4 + 5 + 6 1 + 2 * 3
This is inherent to this grammar -- it takes human knowledge to know how we want to parse, say, the second expression here. We could manually resolve it by changing our grammar to something like this:
<expr> ::= | <term> + <expr> | <term> <term> ::= | <factor> * <term> | <factor> <factor> ::= | NUM | (<expr>)
But even though this grammar is now unambigious, it still isn't LL(1). Multiple production rules can start with the same terminal. The predictive parser we wrote before won't handle it.
But! There's yet another trick up our sleeve. The following grammar is equivalent and LL(1):
<expr> ::= | <term> <expr'> <expr'> ::= | + <expr> | ε <term> ::= | <factor> <term'> <term'> ::= | * <term> | ε <factor> ::= | NUM | (<expr>)
Let's implement this in OCaml.
We actually don't need a helper function for each nonterminal.
It's a little cleaner in this case to combine the helper functions for
type token = PLUS | TIMES | LPAREN | RPAREN | NUM of int type expr = Num of int | Plus of expr * expr | Times of expr * expr exception ParseError let rec parse_expr toks = let t, toks = parse_term toks in match toks with | PLUS :: toks -> let e, toks = parse_expr toks in (Plus (t, e), toks) | _ -> (t, toks) and parse_term toks = let f, toks = parse_factor toks in match toks with | TIMES :: toks -> let e, toks = parse_term toks in (Times (f, e), toks) | _ -> (f, toks) and parse_factor toks = match toks with | NUM n :: toks -> (Num n, toks) | LPAREN :: toks -> ( let e, toks = parse_expr toks in match toks with RPAREN :: toks -> (e, toks) | _ -> raise ParseError ) | _ -> raise ParseError