OCaml and S-Expressions
Last time we wrote a compiler for a very small language: the integers. Next time, we'll grow our language a bit. Today, though, we're going to step back and learn some OCaml. We'll also talk about how to represent our language's syntax.
OCaml basics
Here's the compiler we wrote last time:
compile.ml
let compile (program: string): string = String.concat "\n" [ "global entry"; "entry:"; Printf.sprintf "\tmov rax, %s" program; "\tret" ]
This program illustrates several OCaml features (see the lecture capture for details):
- Functions
- Types
- Lists
There are some list functions that will be helpful for us to know about.
Keep these in mind, and see what their output is by running the following in utop
:
let f x = x * 2;; List.map f [1; 2; 3];; List.map (fun x -> x * 2) [1; 2; 3];; List.filter (fun x -> x > 2) [1; 2; 3];; List.fold_left (fun x y -> x + y) 0 [1; 2; 3];; List.fold_left ( + ) 0 [1; 2; 3];;
S-Expressions
We want to eventually extend our language with support for many new features, such as conditionals and functions. We will talk in great depth about the semantics of those features: what a conditional expression means. We'll also need to figure out what the syntax corresponding to those features is. For instance, conditionals look like this in Javascript:
if (x) { return y; } else { return z; }
And like this in Python:
if x: return y else: return z
We will talk more about syntax much later in the course. For now, our language is going to support a very simple syntax: S-expressions.
S-expressions (short for "symbolic expressions") were introduced by John McCarthy in his paper on the LISP programming language. They were originally intended as an internal representation that programmers wouldn't manipulate directly, but LISP programmers liked writing them and the intended language syntax (M-Expressions) was never implemented. Here are some S-expressions:
(if x y z) (* (+ 2 6 4) 3) (define (f x) (if (= x 0) 1 (* x (f (- x 1)))))
S-expressions are a simple, convenient way to represent structured expressions. Because of their simple structure, they are very easy to represent and manipulate inside a compiler or an interpreter.
Look at the expressions above. What components are there? Ignoring everything we might know about what the expressions mean and focusing on how they look, we can see three different kinds of expression:
- Symbols
such asf
andif
- Numbers
such as4
and0
- Lists
sequences of other S-Expressions wrapper in parentheses, like(+ 2 6 4)
or(* (+ 2 6 4) 3)
S-Expressions in OCaml
How should we represent S-expressions in OCaml? One option would be to use strings. So we could have
let e1 = "(if x y z)" let e2 = "(* (+ 2 6 4) 3)"
This representation is convenient to read in: programmers write programs
as sequences of characters in files. It is, however, very difficult to
work with. Imagine extending our compiler to support unary operations on
numbers (for instance, a function that increments a number). Doing so
using this string representation would be very difficult. We'd have
to see if the first character of the string is a (
, then look at the
next characters to find the operation, then find the argument, and so
on. Moreover, we'd need to repeat this effort for every operation we
want to support.
A better representation for S-expressions will use the structure we talked about above. We can add a new type to OCaml to encode this structure, as follows:
type s_exp = Sym of string | Num of int | Lst of s_exp list
Sym
, Num
, and Lst
are all constructors: ways of building
s-expressions. So what this type definition is saying is that an
s-expression can be a symbol or a number or a list.
The "of" after each constructor indicates that that constructor is
associated with some data of a particular type. For instance, the symbol
if
represented as an s_exp
looks like Sym "if"
; 4000000000000
looks like Num 4000000000000
.
Lst
is defined recursively: it takes as its argument a list of other
s-expressions. These could be build using the Sym
or Num
constructors or using more Lst
constructors. So for instance, `(if x
y z)` looks like
let e1 = Lst [Sym "if"; Sym "x"; Sym "y"; Sym "z"]
and (* (+ 2 6 4) 3)
looks like
let e2 = Lst [Sym "*"; Lst [Sym "+"; Num 2; Num 6; Num 4]; Num 3]
We can see that compared to the string representation, which is "flat", this representation includes the nested structure of s-expressions. We can access this structure using pattern matching:
let which_constructor (e: s_exp) : string = match e with | Sym x -> "a symbol" | Num n -> "a number" | Lst l -> "a list"
This looks like a repeated "if". The real power of pattern matching is destructuring: we can use names defined on the left-hand side of each pattern.
let rec total (e: s_exp) : int = match e with | Sym _ -> 0 | Num n -> n | Lst l -> List.fold_left ( + ) 0 (List.map total l)
For the Lst
constructor, it can sometimes be useful to destructure the
list as well:
let rec has_if (e: s_exp) : bool = match e with | Sym _ -> false | Num _ -> false | Lst (Sym "if" :: _) -> true | Lst l -> List.exists has_if l
S-expressions in our compiler
For the purposes of this course, we have a small library for dealing with S-expressions.
HW1, out today, is designed to get you some practice working with this library.
The library defines the same s_exp
type listed above; it also defines
functions to convert strings to s-expressions.
If our OCaml project is set up correctly, we can use the library like this:
open S_exp let e1 = parse "(* (+ 2 6 4) 3)"
parse
takes in a string and returns the corresponding s-expression.
Later in the course, we will see how this function works.
The compiler we have developed can be modified to use s-expressions:
compile.ml
open S_exp let compile (program: s_exp): string = match program with | Num n -> String.concat "\n" [ "global entry"; "entry:"; Printf.sprintf "\tmov rax, %d" n; "\tret" ]
We only support numbers, so we haven't included the other cases. Our OCaml editor will complain about this program, saying something about "non-exhaustive pattern matching"; this means that there are some s-expressions that we haven't matched. We can "fix" this error like this:
compile.ml
open S_exp exception BadExpression of s_exp let compile (program: s_exp): string = match program with | Num n -> String.concat "\n" [ "global entry"; "entry:"; Printf.sprintf "\tmov rax, %d" n; "\tret" ] | e -> raise (BadExpression e)
Now we will get a somewhat-useful error if we try to compile something that isn't a number.