ASTs
There's something I find a little frustrating about our current compiler setup.
The s_exp
data type is very general: maybe a bit too general.
Programs like (+ 3 4)
are converted into s-expressions in an obvious way.
But unfortunately, "programs" like (hello cs1260)
can also be converted into s-expressions.
Lst [Sym "hello"; Sym "cs1260"]
is a term of type s_exp
in our OCaml library,
but it's not something we want to compile.
Basically, the s_exp
datatype doesn't accurately represent the valid programs of our language.
So far we've called a program valid if the interpreter is willing to evaluate it.
Let's refine that a bit.
Syntactic validity
If a program is just a string of symbols, there are lots that we want to rule out as valid.
Strings with mismatched parentheses, nonsense keywords, etc: all out.
Our s_exp
type didn't represent programs like this.
But more subtlely, our language heavily restricts what symbols appear in "function application"
position, at the front of an s-expression.
This is why (+ 2 3)
is okay but (hello cs1260)
is not.
If an s-expression only applies symbols that are part of our language, and applies them to the expected number of arguments, we will say that it is syntactically valid. This should be easy for us to check, and if we do so, we can throw errors on these programs before they even reach the compiler or interpreter. Effectively, these will be "parse errors." By construction, our compiler and interpreter will produce the same output on syntactically invalid programs: they'll both error out at the parsing step. This is good, and removes a point at which we could see divergent behavior!
ASTs
Technically we're not really talking about parsing yet. The way we'll implement this idea is by adding a new transformation layer to our pipeline. We'll take programs as strings, convert them to s-expressions, then convert these s-expressions to abstract syntax trees, which by design will only represent syntactically valid programs. We'll update our interpreter and compiler to work on ASTs instead of directly on s-expressions.
First: what are all of the valid unary operations in our language? Let's make a list.
open S_exp exception BadSExpression of s_exp type prim1 = Add1 | Sub1 | ZeroP | NumP | Not let prim1_of_string (s : string) : prim1 option = match s with | "add1" -> Some Add1 | "sub1" -> Some Sub1 | "zero?" -> Some ZeroP | "num?" -> Some NumP | "not" -> Some Not | _ -> None
Similarly, for binary operations:
type prim2 = Plus | Minus | Eq | Lt let prim2_of_string = function | "+" -> Some Plus | "-" -> Some Minus | "=" -> Some Eq | "<" -> Some Lt | _ -> None
We can now concisely write down a type that captures all and only syntactically valid expressions.
type expr = | True | False | Num of int | If of expr * expr * expr | Prim1 of prim1 * expr | Prim2 of prim2 * expr * expr
Yes, we could have implemented If
as a "Prim3
", but we won't have many of those!
Note that there is no expr
corresponding to (hello cs1260)
.
let rec expr_of_s_exp (e : s_exp) : expr = match e with | Num n -> Num n | Sym "true" -> True | Sym "false" -> False | Lst [Sym "if"; e1; e2; e3] -> If (expr_of_s_exp e1, expr_of_s_exp e2, expr_of_s_exp e3) | Lst [Sym op; e1] when Option.is_some (prim1_of_string op) -> Prim1 (Option.get (prim1_of_string op), expr_of_s_exp e1) | Lst [Sym op; e1; e2] when Option.is_some (prim2_of_string op) -> Prim2 ( Option.get (prim2_of_string op) , expr_of_s_exp e1 , expr_of_s_exp e2 ) | _ -> raise (BadSExpression e)
ASTs in the compiler
It only takes very minor updates to rebase our compiler to use this new AST type.
let rec compile_exp (stack_index : int) (exp : expr) : directive list = match exp with | True -> [Mov (Reg Rax, operand_of_bool true)] | False -> [Mov (Reg Rax, operand_of_bool false)] | Num n -> [Mov (Reg Rax, operand_of_num n)] | Prim1 (Add1, arg) -> compile_exp stack_index arg @ [Add (Reg Rax, Imm (1 lsl num_shift))] | Prim1 (Sub1, arg) -> compile_exp stack_index arg @ [Sub (Reg Rax, Imm (1 lsl num_shift))] | Prim1 (Not, arg) -> compile_exp stack_index arg @ [Cmp (Reg Rax, Imm ((0 lsl bool_shift) lor bool_tag))] @ zf_to_bool | Prim1 (ZeroP, arg) -> compile_exp stack_index arg @ [Cmp (Reg Rax, operand_of_num 0)] @ zf_to_bool | Prim1 (NumP, arg) -> compile_exp stack_index arg @ [And (Reg Rax, Imm num_mask); Cmp (Reg Rax, Imm num_tag)] @ zf_to_bool | Prim2 (Plus, e1, e2) -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [Add (Reg Rax, MemOffset (Reg Rsp, Imm stack_index))] | Prim2 (Minus, e1, e2) -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [ Mov (Reg R8, Reg Rax) ; Mov (Reg Rax, MemOffset (Reg Rsp, Imm stack_index)) ] @ [Sub (Reg Rax, Reg R8)] | Prim2 (Eq, e1, e2) -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [ Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index)) ; Cmp (Reg Rax, Reg R8) ] @ zf_to_bool | Prim2 (Lt, e1, e2) -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [ Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index)) ; Cmp (Reg Rax, Reg R8) ] @ lf_to_bool | If (test_exp, then_exp, else_exp) -> let else_label = gensym "else" in let continue_label = gensym "continue" in compile_exp stack_index test_exp @ [Cmp (Reg Rax, operand_of_bool false); Jz else_label] @ compile_exp stack_index then_exp @ [Jmp continue_label] @ [Label else_label] @ compile_exp stack_index else_exp @ [Label continue_label] let compile_to_file (program : string) : unit = let file = open_out "program.s" in parse program |> expr_of_s_exp |> compile |> output_string file ; close_out file
The key thing to note here: our compiler will never throw an error!
This is a pure function (except for gensym
, shh).
Programs like (hello cs1260)
will be caught at the s_exp -> expr
conversion stage.
ASTs in the interpreter
Updating our interpreter is similarly straightforward.
let rec interp_exp (exp : expr) : value = match exp with | Num n -> Number n | True -> Boolean true | False -> Boolean false | Prim1 (Not, arg) -> if interp_exp arg = Boolean false then Boolean true else Boolean false | Prim1 (ZeroP, arg) -> ( match interp_exp arg with | Number 0 -> Boolean true | _ -> Boolean false ) | Prim1 (NumP, arg) -> ( match interp_exp arg with | Number _ -> Boolean true | _ -> Boolean false ) | Prim1 (Add1, arg) -> ( match interp_exp arg with | Number n -> Number (n + 1) | _ -> raise (BadExpression exp) ) | Prim1 (Sub1, arg) -> ( match interp_exp arg with | Number n -> Number (n - 1) | _ -> raise (BadExpression exp) ) | Prim2 (Plus, e1, e2) -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Number (n1 + n2) | _ -> raise (BadExpression exp) ) | Prim2 (Minus, e1, e2) -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Number (n1 - n2) | _ -> raise (BadExpression exp) ) | Prim2 (Eq, e1, e2) -> Boolean (interp_exp e1 = interp_exp e2) | Prim2 (Lt, e1, e2) -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Boolean (n1 < n2) | _ -> raise (BadExpression exp) ) | If (test_exp, then_exp, else_exp) -> if interp_exp test_exp = Boolean false then interp_exp else_exp else interp_exp then_exp let interp (program : string) : string = parse program |> expr_of_s_exp |> interp_exp |> string_of_value
But: our interpreter still can throw error messages. Why the divergence?
Semantic validity
The programs we ruled out with this AST type were ones that didn't make sense
at parse time. But there are programs that, while syntactically valid, don't
sensibly correspond to values. We've seen plenty of examples here; think of
our old friend (+ 1 false)
.
Hey, OCaml would never let us write 1 + false
.
Its ASTs seem to hold even more information than ours:
expressions get their types tracked alongside their values.
We could try to do this too, and catch these issues at parse time as well.
But other languages like Python are fine with strange things like this,
and there are perfectly "reasonable" programs we could write down
where the type isn't known at compile time.
(Can you think of some?)
The points at which the interpreter throws errors correspond to expressions
that are syntactically valid, but not meaningful based on our decisions about
the symbols of our language.
We've made choices, for example, not to treat true
and false
as 1
and 0
for the sake of addition.
The interpreter codifies our semantic choices, and points out when an expression
is semantically invalid.
For semantically valid programs, we want to insist that our compiler and interpreter match. Syntactically invalid programs won't even parse. On programs in the middle, where the interpreter runs but errors, our compiler can do anything -- we've failed to specify the behavior we expect, so all bets are off.