Binary operations
So far, the programs in our language have only had to deal with one
value at a time. That's quite intentional–by restricting our language in
this way we've been able to compile everything using only the rax
register! Today, that changes. Instead of dealing with one value, we're
going to introduce operations that deal with–get this–two values. As
it turns out, this is much more challenging!
Binary operations in the interpreter
…Or at least, it's much more challenging in the compiler. Binary operations in our interpreter really won't be very different from unary operations! First off, here are the operations we will support:
(+ e1 e2)
adds the values of the expressionse1
ande2
.e1
ande2
should evaluate to numbers.(- e1 e2)
subtracts the values of the expressionse1
ande2
.e1
ande2
should evaluate to numbers.(= e1 e2)
evaluates totrue
ife1
ande2
evaluate to the same value andfalse
otherwise(< e1 e2)
evaluates totrue
ife1
evaluates to a smaller number thane2
andfalse
otherwise.e1
ande2
should evaluate to numbers.
Here are the cases we'll need to add to interp_exp
:
interp.ml
let interp_exp exp = match exp with (* some cases elided... *) | Lst [Sym "+"; e1; e2] -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Number (n1 + n2) | _ -> raise (BadExpression exp) ) | Lst [Sym "-"; e1; e2] -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Number (n1 - n2) | _ -> raise (BadExpression exp) ) | Lst [Sym "="; e1; e2] -> Boolean (interp_exp e1 = interp_exp e2) | Lst [Sym "<"; e1; e2] -> ( match (interp_exp e1, interp_exp e2) with | Number n1, Number n2 -> Boolean (n1 < n2) | _ -> raise (BadExpression exp) )
Notice that this code enforces type-correctness: +
and <
will only
work on numbers. Just as we've seen with unary operations and
conditionals, the interpreter is just relying on OCaml's implementations
of these features.
Binary operations in the compiler
Here's where it gets tricky.
Let's try to sort of "naively" translate the interpreter version of +
(reminder: right now the compiler, unlike the interpreter, does not
enforce type-correctness):
compile.ml
let compile_exp exp = match exp with (* some cases elided... *) | Lst [Sym "+"; e1; e2] -> compile_exp e1 @ compile_exp e2 @ ...
Remember: compile_exp
emits instructions to compute the value of exp
and to store that value in rax
. So by the time we want to add the two
values, e2
is going to be in rax
and e1
is going to be lost! So,
we'll somehow need to "save" the value of e1
. Here's an idea: we could
save it to a different register! x86-64 has 16 general-purpose
registers; let's use r8
, written Reg R8
in our OCaml assembly library:
compile.ml
let compile_exp exp = match exp with (* some cases elided... *) | Lst [Sym "+"; e1; e2] -> compile_exp e1 @ [Mov (Reg R8, Reg Rax)] @ compile_exp e2 @ [Add (Reg Rax, Reg R8)]
Here we're saving compiling e1
into rax
, saving rax
into r8
,
compiling e2
into rax
, then adding the results of the two
expressions together. This seems to work great!
$ compile_and_run "(+ 1 2)";; 3 $ compile_and_run "(+ (+ 1 2) 3)";; 6 $ compile_and_run "(+ 1 (+ 2 3))";; 7
Wait, what was that last result? Something's not right here. Let's look at the assembly we're producing:
entry: mov rax, 4 mov r8, rax mov rax, 8 mov r8, rax mov rax, 12 add rax, r8 add rax, r8 ret
We're compiling (+ 1 (+ 2 3))
by first storing the runtime
representation of 1
in r8
, then compiling the second argument to
+
. But since the second argument is also a call to +
, the first
thing it's going to do is do overwrite the value in r8
(in this case,
with the runtime representation of 2
).
We could try to get around this by using more registers. We could
imagine having our compiler take a list of registers it's not allowed to
use when compiling an expression–here, since r8
is being used to store
1
, we couldn't use r8
when compiling (+ 2 3)
. If we had an
infinite number of registers, a scheme like this could work. But since
we only have 16, there are going to be expressions that we won't be able
to compile with that kind of scheme.
So we need someplace to store intermediate values during computation, where we won't run out of room. How about memory?
The stack
The region of memory that our program has available for temporary use during computations is called the stack. (Longer-lived values live in the heap, which we'll talk about in a few lectures.) We'll start with a simple model of this region of memory; we'll make this model more complex, and somewhat more accurate, when we talk about functions.
Imagine the stack as an array of cells, each of which has an
address. The bottom of our stack is at the highest address. When
our program starts executing, the register rsp
contains this
address. The memory cell at this address contains the function's
return address. We'll learn more about what that means later; for
now, just know that we shouldn't overwrite the data at that address.
The "next" memory cell in the stack–that is, the first cell that we
can write data into–is at (rsp - 8)
. Why -
? Because the stack
grows "up", from higher addresses to lower addresses. rsp + 8
probably contains data used by the calling function. Why 8
?
Because the word size on x86-64 is 8 bytes (64 bits). x86-64
memory addresses are 8 bytes; x86-64 registers are 8 bytes; all of
our program values are 8 bytes. So the stack looks like this:
address | data |
---|---|
… | … |
rsp - 16 |
unused |
rsp - 8 |
unused |
rsp |
address of caller stack frame |
Accessing the stack from assembly
We've seen the mov
instruction before–it lets us move immediate
data into registers, or move data between registers. It also lets us
move data between registers and memory. So, let's modify our
compiler to save the value of the first argument to +
to memory
instead of r8
.
compile.ml
let compile_exp exp = match exp with (* some cases elided... *) | Lst [Sym "+"; e1; e2] -> compile_exp e1 @ [Mov (MemOffset (Reg Rsp, Imm (-8)), Reg Rax)] @ compile_exp e2 @ [Mov (Reg R8, MemOffset (Reg Rsp, Imm (-8)))] @ [Add (Reg Rax, Reg R8)]
If we compile (+ 1 2)
now, we get this:
entry: mov rax, 4 mov [rsp + -8], rax mov rax, 8 mov r8, [rsp + -8] add rax, r8 ret
Those square-bracketed expressions are how our assembly language
represents memory accesses. As we see, offsets into memory (of the
form <operand> + <operand>
) can be used as operands to instructions like mov
and
add
.
What happens if we compile (+ 1 (+ 2 3))
now? We still have the
same problem we did before–2
is overwriting 1
, this time at
[rsp - 8]
instead of in r8
:
entry: mov rax, 4 mov [rsp + -8], rax mov rax, 8 mov [rsp + -8], rax mov rax, 12 mov r8, [rsp + -8] add rax, r8 mov r8, [rsp + -8] add rax, r8 ret
Now, though, we'll be able to fix this issue.
Tracking the stack index
Instead of storing the intermediate value 2
at [rsp - 8]
, the
compiler should store it at the next available stack address:
[rsp - 16]
. So when we call compile_exp e2
, we will need to let it
know that [rsp - 16]
is the new first stack address.
We can implement this by adding an argument to compile_exp
representing the next available stack index:
compile.ml
let compile_exp (stack_index : int) exp = ...
Most of the time, this stack_index
argument will remain unchanged
through recursive calls. But if we store something on the stack,
we'll need to update it. Right now, we need to do that in exactly
one place: that compile_exp e2
call. We'll modify our code to
store the intermediate value at [rsp + stack_index]
, and to subtract 8 from the stack index for that
recursive call:
compile.ml
let compile_exp exp = match exp with (* some cases elided... *) | Lst [Sym "+"; e1; e2] -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index))] @ [Add (Reg Rax, Reg R8)]
We now get the following code for (+ 1 (+ 2 3))
:
entry: mov rax, 4 mov [rsp + -8], rax mov rax, 8 mov [rsp + -16], rax mov rax, 12 mov r8, [rsp + -16] add rax, r8 mov r8, [rsp + -8] add rax, r8 ret
This now works great! We've successfully implemented addition.
Other binary operations
Our code for the other binary operations we support looks similar.
Note that since -
isn't commutative, we need to be careful about
the order of the first and second arguments!
compile.ml
let compile_exp exp = match exp with (* some cases elided... *) | Lst [Sym "-"; e1; e2] -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [Mov (Reg R8, Reg Rax)] @ [Mov (Reg Rax, MemOffset (Reg Rsp, Imm stack_index))] @ [Sub (Reg Rax, Reg R8)] | Lst [Sym "="; e1; e2] -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index))] @ [Cmp (Reg Rax, Reg R8)] @ zf_to_bool | Lst [Sym "<"; e1; e2] -> compile_exp stack_index e1 @ [Mov (MemOffset (Reg Rsp, Imm stack_index), Reg Rax)] @ compile_exp (stack_index - 8) e2 @ [Mov (Reg R8, MemOffset (Reg Rsp, Imm stack_index))] @ [Cmp (Reg R8, Reg Rax)] @ lf_to_bool
<
uses lf_to_bool
, which calls setl
instead of setz
. setl
reads the SF
and OF
flags; after a comparison operation, it will
set its operand to 1
if the first comparison argument was strictly
less than the second.
let lf_to_bool : directive list = [ Mov (Reg Rax, Imm 0); Setl (Reg Rax); Shl (Reg Rax, Imm bool_shift); Or (Reg Rax, Imm bool_tag); ]
A note about undefined behavior
What should this expression evaluate to?
(+ 1 false)
Our interpreter gives us the answer: just like a nonsense expression
like (hello csci1260)
, (+ 1 false)
isn't part of our language,
so the interpreter raises an exception. What will our compiler do on
this program?
$ compile_and_run "(+ 1 false)";; "BAD 35"
Our runtime indicates that we've produced a bad value (specifically, 35)–it doesn't correspond to anything in our tagging scheme. So, OK–the compiler and the interpreter both end up producing errors on this program.
How about this program?
(+ 32 false)
Our interpreter, of course, still throws an exception. But our compiler does something pretty weird:
$ compile_and_run "(+ 32 false)";; "true"
Weird, right? It makes sense, though: since false
is represented
as 0b00011111
and 32
is represented as 0b10000000
, false + 32
is 0b10011111
, the runtime representation of true
.
So: is this a bug in our compiler? Maybe not! (+ 32 false)
isn't
a valid program in the language supported by our compiler. There are
lots of these invalid programs! Different ones result in different
things:
- Programs like
(hello csci1260)
result in an exception at compile time - Programs like
(+ 1 false)
result in a runtime error - Programs like
(+ 32 false)
result in a strange value
The behavior of our compiler on these programs is undefined. We can error our at compile-time, error out at runtime, produce a reasonable-looking value, or anything else. Some real-world programming languages include undefined behavior as part of the language standard; for instance, dereferencing a null pointer in C is undefined.
Many modern languages, however, eschew undefined behavior–as we have
just seen, it's quite confusing for programmers! A reasonable
specification for programs like (+ 1 true)
is that they should
result in errors. In a couple of weeks, we will see how to add
error-handling to our compiler and get rid of this strange behavior.
For now, we'll leave our compiler's behavior specified only on valid
expressions.