What is a compiler?
Introduction
A compiler is a program that takes in source code written in one language (called the source language) and returns source code written in another language (called the target language). Here's how we might write this in quasi-mathematical notation:
compiler : source_program -> target_program
That :
is read "has type". The arrow represents a function type; in
this case, a function from programs written in the source language to
programs written in the target language.
What are source_program
and target_program
? They must be types
representing code in each programming language. For many or most
programming language, that type is ultimately string
; programmers
write programs as sequences of characters. So we could write something
like
compiler : string -> string
For this class, though, we'll use the previous definition. In some languages–for instance, block-based programming languages like Scratch–programmers usually don't write source code as strings. Also, the first definition is a bit more specific about what these strings should look like.
So, now that we have our definition: what are some compilers you've worked with? What are their types?
compiler |
: |
source_program |
-> |
target_program |
---|---|---|---|---|
javac |
Java | Java bytecode | ||
gcc |
C | Machine code | ||
… |
How is a compiler different from an interpreter?
In addition to compilers, you might have used interpreters before. Interpreters are closely related to compilers; indeed, in this course you'll develop a number of interpreters as well as a number of compilers. What's the type of an interpreter?
Like a compiler, an interpreter takes in program code. What does it produce?
Here's one possible type:
interpreter : source_program -> value
Rather than producing more code in some target language, as a compiler does, an interpreter interprets its input code and produces some value.
One very important interpreter for our purposes is the computer's CPU! Depending on your computer, this is probably an interpreter for x86-64 machine code. Machine code is the "native" language of the computer. It's often targeted by compilers, including the compilers we'll develop in this course.
This class
In this class we will be developing, testing, and discussing compilers–in fact, we'll write our first compiler in just a little while. Why does this class exist? Why should you take it?
One reason is that compilers are interesting! For me at least, one of the big appeals of computer science is that it blends elegant theory with practical utility, and I think this topic is a great example of that. On the theory side, we'll learn about formal notions of compiler correctness and algorithms for things like register allocation; on the practice side, we'll get our hands dirty writing real compilers for real systems.
On a related note, compilers are complex software systems. Like an operating systems class, a compilers class is an opportunity to deploy good software engineering and testing practices on decently large programs. We'll talk a lot in this course about testing in particular!
I'm a mathematician by training, and my research is in logic and verification.
Why am I teaching compilers? If we just think of compilers as functions
compile : source_program -> target_program
, our story is missing something: a good compiler
shouldn't "change" the program, in some sense. To make that precise we need a way to
assign meanings (semantics) to programs in our source and target languages,
and then we can make a relational claim about our inputs and outputs, something like
"for every source program p
, the meaning of compile p
is the same as the meaning of p
."
Even better, we might want to prove that this claim holds.
Semantics, relations, proof -- we're starting to sound like logicians.
And indeed, in the verification world, there's a line of research into producing
formally correct compilers. Time permitting, we'll talk a little about this
at the end of the course.
There are lots of big, well-known and much-used compilers out there in the world.
This course is not about those.
It's totally reasonable to want to take a compilers course to learn about
the intricate details and implementation of gcc
, javac
, etc.
If this is what you're looking for, though, this course might not be for you.
Similarly, compilers are sometimes seen as a heavy, techy, systems topic.
We'll unavoidably be writing some assembly code in this course.
But I am not much of a systems hacker myself,
and if you're hoping to see how to bootstrap a compiler up from nothing,
or dive headfirst into the complicated world of x86,
you may find yourself disappointed.
This course will take a somewhat higher-level perspective,
looking at features and design options that appear in real-world compilers
in perhaps slightly different forms.
Logistics
A few logistics points for now:
- The course website is at https://browncs1260.github.io/.
- Almost everything will live there rather than Canvas.
- Two (short!) assignments are due on Tuesday, both linked from
the course website:
- The first drill. There will be drills due every Sunday night. (Get them out of the way and do them after class on Wednesday!) These are short quizzes, graded on completion; the idea is to make sure everyone is on the same page regarding each week's class material.
- The first homework assignment, HW0. Homeworks will be due weekly or biweekly, almost always at 11:59pm Eastern Time on Wednesdays. HW0 is a short programming assignment in the OCaml programming language. We're not expecting you to have written any OCaml before this course; the point of this homework is to make sure everyone has the development environment set up and has started to learn the language. This homework will be graded on completion, but if you struggle with it, we ask that you attend the OCaml gearup session (next week, details very soon). We'll talk more about OCaml next week.
- Most of your grade for the course will be based on the homeworks. You'll also get some points for completing the drills and attending weekly lab sessions held by the TAs (labs start next week). The other component of your grade will be the midterm and final exams. These exams will be short, low-stress, in-class exams based on the drills and homeworks. You won't be asked to solve new problems or write new programs on these examinations.
A final note: if you want to follow along with the rest of today's lecture,
check out the class-compiler-2023 repository.
You can click <> Code
and create a new codespace to follow along in a web browser.
We'll generally use codespaces as our development environment for this class.
See out Environment setup guide.
Our first compiler
OK: let's write a compiler! Eventually, we'll write a compiler for a programming language with the following features:
- Arithmetic
- Conditionals (ifs)
- Variables
- Lists
- First-class functions
- …and more!
We'll tackle these and other features one at a time, slowly growing the language our compiler supports. We'll talk more about what this language will look like next time. For now, let's try to write a compiler for a very small subset of this language: the integers!
Here are some examples of programs written in this language:
42
10
-4
4000000000000
What should these programs do? Well, let's see what they do in Python:
$ python >>> 42 42 >>> 10 10
It looks like these programs should sort of just return themselves.
Our compiler is going to target x86-64 machine code. That way, we won't need any other programs in order to interpret the programs we produce. What should a program to print out a number look like?
Printing a number, the hard way
We can find out by writing such a program in C:
program.c
#include <stdio.h> #include <inttypes.h> int64_t entry() { return 4000000000000; } int main(int argc, char **argv) { printf("%" PRIi64, entry()); return 0; }
This program is a bit more complex than it needs to be for now.
We're separating out the entry
and main
functions because doing
so will be useful later.
If you haven't seen C before, that's OK–all this program is doing is
printing a string to the screen and then exiting. The reason we've
written it in C is that we already have a compiler from C to machine
code, gcc
. So, let's make sure our program works:
$ gcc -o program program.c $ ./program 4000000000000$
OK, it seems to do the right thing (it's just printing the number, not a newline). Let's try to take a look at the code it produced!
$ cat program # the [cat] command prints out the contents of a file
This command results in a lot of unreadable garbage. That's not very helpful! What's going on here? Some of that stuff is actually x86-64 machine code, in binary format. Other parts are sort of boilerplate that tells the operating system how to execute this program. We can get a more useful output like this:
$ gcc -S -masm=intel -m64 program.c
$ cat program.s
.file "program.c" .intel_syntax noprefix .text .globl entry .type entry, @function entry: .LFB0: .cfi_startproc endbr64 push rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 mov rbp, rsp .cfi_def_cfa_register 6 mov eax, 4000000000 pop rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size entry, .-entry .section .rodata .LC0: .string "%li" .text .globl main .type main, @function main: .LFB1: .cfi_startproc endbr64 push rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 mov rbp, rsp .cfi_def_cfa_register 6 sub rsp, 16 mov DWORD PTR -4[rbp], edi mov QWORD PTR -16[rbp], rsi mov eax, 0 call entry mov rsi, rax lea rax, .LC0[rip] mov rdi, rax mov eax, 0 call printf@PLT mov eax, 0 leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE1: .size main, .-main .ident "GCC: (Ubuntu 11.2.0-19ubuntu1) 11.2.0" .section .note.GNU-stack,"",@progbits .section .note.gnu.property,"a" .align 8 .long 1f - 0f .long 4f - 1f .long 5 0: .string "GNU" 1: .align 8 .long 0xc0000002 .long 3f - 2f 2: .long 0x3 3: .align 8 4:
This is called assembly language. It's basically a human-readable form of machine-code. Translating between assembly language and machine code is a simple mechanical process; we call this assembling.
Building a runtime
Let's look at the code above. There are two big sections, starting
with entry
and main
.
(If you're following these notes on a Mac, you may see extra underscores.)
These correspond to the two
functions, entry
and main
, in our C code.
In order to write our compiler, we're going to replace the entry
function (we've called it that because it's the entry point to the
code our compiler is producing). We're going to replace it with code
that looks a lot like the code up there, but simpler. We're going to
split that code out into its own file, and that's what our compiler
will produce.
First, we'll modify our C program:
runtime.c
#include <stdio.h> #include <inttypes.h> extern int64_t entry(); int main(int argc, char **argv) { printf("%" PRIi64, entry()); return 0; }
This program is called the runtime for our compiler. The runtime will be included in every program our compiler produces. Here, it just handles calling our function and printing the result. Later we'll extend it to do more things.
Note that we're writing the runtime in C, but we don't have to–we could write it in assembly, or some other language that can produce machine code. C is a convenient language for this purpose, but it's certainly not the only choice! Because we've written the runtime in C, C's runtime gets included in our programs as well.
We can compile the runtime like this:
$ gcc -c runtime.c -o runtime.o
Back to our entry
function. We can write it in assembly language
like this:
program.s
global entry entry: mov rax, 4000000000000 ret
Line by line, this program:
- Declares that it's going to define one function, called
entry
- Starts the
entry
function - Writes the value 4000000000000 into the
rax
register. We'll talk more about registers later; for now, the important thing is that by convention in x86-64, the return value of a function goes inrax
. - Returns from the function.
We can build this program like this:
$ nasm program.s -f elf64 -o program.o
Finally, we can combine the runtime and the program:
$ gcc program.o runtime.o -o program $ ./program 4000000000000$
Our own compiler
OK, so now we know what our compiler needs to do. It needs to turn a "program" like
4000000000000
into an assembly-language program like
global entry entry: mov rax, 4000000000000 ret
Let's write a program to do this! We'll write our program in OCaml (a programming language we'll discuss in much more detail next time). Here's what it might look like.
(Note, if you're looking at the current state of the
class-compiler-2023 repository,
I'm working on this file in the lib
directory.)
lib/compile.ml
let compile (program: string): string = String.concat "\n" [ "global entry"; "entry:"; Printf.sprintf "\tmov rax, %s" program; "\tret" ]
This program just makes a list of our assembly-language lines and then glues them together with newlines.
We can see our compiler work if we open up an OCaml interaction
environment by running dune utop
:
>>> open Csci1260.Compile;; >>> print_endline (compile "4000000000000");; global entry entry: mov rax, 4000000000000 ret
We can glue everything together like this:
compile.ml
let compile (program: string) : string = String.concat "\n" ["global entry"; "entry:"; Printf.sprintf "\tmov rax, %s" program; "\tret"] let compile_to_file (program: string) : unit = let file = open_out "program.s" in output_string file (compile program); close_out file let compile_and_run (program: string) : string = compile_to_file program; ignore (Unix.system "nasm program.s -f elf64 -o program.o"); ignore (Unix.system "gcc program.o runtime.o -o program -z noexecstack"); let inp = Unix.open_process_in "./program" in let r = input_line inp in close_in inp; r
So–that's our first compiler. Next time, we'll talk more about the language we're trying to compile. We'll also do a quick introduction to OCaml.