ASTs

There's something I find a little frustrating about our current compiler setup. The s_exp data type is very general: maybe a bit too general. Programs like (+ 3 4) are converted into s-expressions in an obvious way. But unfortunately, "programs" like (hello cs1260) can also be converted into s-expressions. Lst [Sym "hello"; Sym "cs1260"] is a term of type s_exp in our OCaml library, but it's not something we want to compile.

Basically, the s_exp datatype doesn't accurately represent the valid programs of our language. So far we've called a program valid if the interpreter is willing to evaluate it. Let's refine that a bit.

Syntactic validity

If a program is just a string of symbols, there are lots that we want to rule out as valid. Strings with mismatched parentheses, nonsense keywords, etc: all out. Our s_exp type didn't represent programs like this. But more subtlely, our language heavily restricts what symbols appear in "function application" position, at the front of an s-expression. This is why (+ 2 3) is okay but (hello cs1260) is not.

If an s-expression only applies symbols that are part of our language, and applies them to the expected number of arguments, we will say that it is syntactically valid. This should be easy for us to check, and if we do so, we can throw errors on these programs before they even reach the compiler or interpreter. Effectively, these will be "parse errors." By construction, our compiler and interpreter will produce the same output on syntactically invalid programs: they'll both error out at the parsing step. This is good, and removes a point at which we could see divergent behavior!

ASTs

Technically we're not really talking about parsing yet. The way we'll implement this idea is by adding a new transformation layer to our pipeline. We'll take programs as strings, convert them to s-expressions, then convert these s-expressions to abstract syntax trees, which by design will only represent syntactically valid programs. We'll update our interpreter and compiler to work on ASTs instead of directly on s-expressions.

First: what are all of the valid unary operations in our language? Let's make a list.

open S_exp

exception BadSExpression of s_exp

type prim1 = Add1 | Sub1 | ZeroP | NumP | Not

let prim1_of_string (s : string) : prim1 option =
  match s with
  | "add1" ->
      Some Add1
  | "sub1" ->
      Some Sub1
  | "zero?" ->
      Some ZeroP
  | "num?" ->
      Some NumP
  | "not" ->
      Some Not
  | _ ->
      None

Similarly, for binary operations:

type prim2 = Plus | Minus | Eq | Lt

let prim2_of_string = function
  | "+" ->
      Some Plus
  | "-" ->
      Some Minus
  | "=" ->
      Some Eq
  | "<" ->
      Some Lt
  | _ ->
      None

We can now concisely write down a type that captures all and only syntactically valid expressions.

type expr =
  | True
  | False
  | Num of int
  | If of expr * expr * expr
  | Prim1 of prim1 * expr
  | Prim2 of prim2 * expr * expr

Yes, we could have implemented If as a "Prim3", but we won't have many of those! Note that there is no expr corresponding to (hello cs1260).

let rec expr_of_s_exp (e : s_exp) : expr =
  match e with
  | Num n ->
      Num n
  | Sym "true" ->
      True
  | Sym "false" ->
      False
  | Lst [Sym "if"; e1; e2; e3] ->
      If (expr_of_s_exp e1, expr_of_s_exp e2, expr_of_s_exp e3)
  | Lst [Sym op; e1] when Option.is_some (prim1_of_string op) ->
      Prim1 (Option.get (prim1_of_string op), expr_of_s_exp e1)
  | Lst [Sym op; e1; e2] when Option.is_some (prim2_of_string op) ->
      Prim2
        ( Option.get (prim2_of_string op)
        , expr_of_s_exp e1
        , expr_of_s_exp e2 )
  | _ ->
      raise (BadSExpression e)

ASTs in the compiler

It only takes very minor updates to rebase our compiler to use this new AST type.

let rec compile_exp (stack_index : int) (exp : expr) : directive list
    =
  match exp with
  | True ->
      [Mov (Reg Rax, operand_of_bool true)]
  | False ->
      [Mov (Reg Rax, operand_of_bool false)]
  | Num n ->
      [Mov (Reg Rax, operand_of_num n)]
  | Prim1 (Add1, arg) ->
      compile_exp stack_index arg
      @ [Add (Reg Rax, Imm (1 lsl num_shift))]
  | Prim1 (Sub1, arg) ->
      compile_exp stack_index arg
      @ [Sub (Reg Rax, Imm (1 lsl num_shift))]
  | Prim1 (Not, arg) ->
      compile_exp stack_index arg
      @ [Cmp (Reg Rax, Imm ((0 lsl bool_shift) lor bool_tag))]
      @ zf_to_bool
  | Prim1 (ZeroP, arg) ->
      compile_exp stack_index arg
      @ [Cmp (Reg Rax, operand_of_num 0)]
      @ zf_to_bool
  | Prim1 (NumP, arg) ->
      compile_exp stack_index arg
      @ [And (Reg Rax, Imm num_mask); Cmp (Reg Rax, Imm num_tag)]
      @ zf_to_bool
  | Prim2 (Plus, e1, e2) ->
      compile_exp stack_index e1
      @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)]
      @ compile_exp (stack_index - 8) e2
      @ [Add (Reg Rax, MemOffset (Reg Rsp, Imm stack_index))]
  | Prim2 (Minus, e1, e2) ->
      compile_exp stack_index e1
      @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)]
      @ compile_exp (stack_index - 8) e2
      @ [ Mov (Reg R8, Reg Rax)
        ; Mov (Reg Rax, MemOffset (Reg Rsp, Imm stack_index)) ]
      @ [Sub (Reg Rax, Reg R8)]
  | Prim2 (Eq, e1, e2) ->
      compile_exp stack_index e1
      @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)]
      @ compile_exp (stack_index - 8) e2
      @ [ Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index))
        ; Cmp (Reg Rax, Reg R8) ]
      @ zf_to_bool
  | Prim2 (Lt, e1, e2) ->
      compile_exp stack_index e1
      @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)]
      @ compile_exp (stack_index - 8) e2
      @ [ Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index))
        ; Cmp (Reg Rax, Reg R8) ]
      @ lf_to_bool
  | If (test_exp, then_exp, else_exp) ->
      let else_label = gensym "else" in
      let continue_label = gensym "continue" in
      compile_exp stack_index test_exp
      @ [Cmp (Reg Rax, operand_of_bool false); Jz else_label]
      @ compile_exp stack_index then_exp
      @ [Jmp continue_label] @ [Label else_label]
      @ compile_exp stack_index else_exp
      @ [Label continue_label]

let compile_to_file (program : string) : unit =
  let file = open_out "program.s" in
  parse program |> expr_of_s_exp |> compile |> output_string file ;
  close_out file

The key thing to note here: our compiler will never throw an error! This is a pure function (except for gensym, shh). Programs like (hello cs1260) will be caught at the s_exp -> expr conversion stage.

ASTs in the interpreter

Updating our interpreter is similarly straightforward.

let rec interp_exp (exp : expr) : value =
  match exp with
  | Num n ->
      Number n
  | True ->
      Boolean true
  | False ->
      Boolean false
  | Prim1 (Not, arg) ->
      if interp_exp arg = Boolean false then Boolean true
      else Boolean false
  | Prim1 (ZeroP, arg) -> (
    match interp_exp arg with
    | Number 0 ->
        Boolean true
    | _ ->
        Boolean false )
  | Prim1 (NumP, arg) -> (
    match interp_exp arg with
    | Number _ ->
        Boolean true
    | _ ->
        Boolean false )
  | Prim1 (Add1, arg) -> (
    match interp_exp arg with
    | Number n ->
        Number (n + 1)
    | _ ->
        raise (BadExpression exp) )
  | Prim1 (Sub1, arg) -> (
    match interp_exp arg with
    | Number n ->
        Number (n - 1)
    | _ ->
        raise (BadExpression exp) )
  | Prim2 (Plus, e1, e2) -> (
    match (interp_exp e1, interp_exp e2) with
    | Number n1, Number n2 ->
        Number (n1 + n2)
    | _ ->
        raise (BadExpression exp) )
  | Prim2 (Minus, e1, e2) -> (
    match (interp_exp e1, interp_exp e2) with
    | Number n1, Number n2 ->
        Number (n1 - n2)
    | _ ->
        raise (BadExpression exp) )
  | Prim2 (Eq, e1, e2) ->
      Boolean (interp_exp e1 = interp_exp e2)
  | Prim2 (Lt, e1, e2) -> (
    match (interp_exp e1, interp_exp e2) with
    | Number n1, Number n2 ->
        Boolean (n1 < n2)
    | _ ->
        raise (BadExpression exp) )
  | If (test_exp, then_exp, else_exp) ->
      if interp_exp test_exp = Boolean false then interp_exp else_exp
      else interp_exp then_exp

let interp (program : string) : string =
  parse program |> expr_of_s_exp |> interp_exp |> string_of_value

But: our interpreter still can throw error messages. Why the divergence?

Semantic validity

The programs we ruled out with this AST type were ones that didn't make sense at parse time. But there are programs that, while syntactically valid, don't sensibly correspond to values. We've seen plenty of examples here; think of our old friend (+ 1 false).

Hey, OCaml would never let us write 1 + false. Its ASTs seem to hold even more information than ours: expressions get their types tracked alongside their values. We could try to do this too, and catch these issues at parse time as well. But other languages like Python are fine with strange things like this, and there are perfectly "reasonable" programs we could write down where the type isn't known at compile time. (Can you think of some?)

The points at which the interpreter throws errors correspond to expressions that are syntactically valid, but not meaningful based on our decisions about the symbols of our language. We've made choices, for example, not to treat true and false as 1 and 0 for the sake of addition. The interpreter codifies our semantic choices, and points out when an expression is semantically invalid.

For semantically valid programs, we want to insist that our compiler and interpreter match. Syntactically invalid programs won't even parse. On programs in the middle, where the interpreter runs but errors, our compiler can do anything -- we've failed to specify the behavior we expect, so all bets are off.