What is a compiler?

Introduction

A compiler is a program that takes in source code written in one language (called the source language) and returns source code written in another language (called the target language). Here's how we might write this in quasi-mathematical notation:

compiler : source_program -> target_program

That : is read "has type". The arrow represents a function type; in this case, a function from programs written in the source language to programs written in the target language.

What are source_program and target_program? They must be types representing code in each programming language. For many or most programming language, that type is ultimately string; programmers write programs as sequences of characters. So we could write something like

compiler : string -> string

For this class, though, we'll use the previous definition. In some languages–for instance, block-based programming languages like Scratch–programmers usually don't write source code as strings. Also, the first definition is a bit more specific about what these strings should look like.

So, now that we have our definition: what are some compilers you've worked with? What are their types?

compiler : source_program -> target_program
javac Java Java bytecode
gcc C Machine code

How is a compiler different from an interpreter?

In addition to compilers, you might have used interpreters before. Interpreters are closely related to compilers; indeed, in this course you'll develop a number of interpreters as well as a number of compilers. What's the type of an interpreter?

Like a compiler, an interpreter takes in program code. What does it produce?

Here's one possible type:

interpreter : source_program -> value

Rather than producing more code in some target language, as a compiler does, an interpreter interprets its input code and produces some value.

One very important interpreter for our purposes is the computer's CPU! Depending on your computer, this is probably an interpreter for x86-64 machine code. Machine code is the "native" language of the computer. It's often targeted by compilers, including the compilers we'll develop in this course.

This class

In this class we will be developing, testing, and discussing compilers–in fact, we'll write our first compiler in just a little while. Why does this class exist? Why should you take it?

One reason is that compilers are interesting! For me at least, one of the big appeals of computer science is that it blends elegant theory with practical utility, and I think this topic is a great example of that. On the theory side, we'll learn about formal notions of compiler correctness and algorithms for things like register allocation; on the practice side, we'll get our hands dirty writing real compilers for real systems.

On a related note, compilers are complex software systems. Like an operating systems class, a compilers class is an opportunity to deploy good software engineering and testing practices on decently large programs. We'll talk a lot in this course about testing in particular!

I'm a mathematician by training, and my research is in logic and verification. Why am I teaching compilers? If we just think of compilers as functions compile : source_program -> target_program, our story is missing something: a good compiler shouldn't "change" the program, in some sense. To make that precise we need a way to assign meanings (semantics) to programs in our source and target languages, and then we can make a relational claim about our inputs and outputs, something like "for every source program p, the meaning of compile p is the same as the meaning of p." Even better, we might want to prove that this claim holds. Semantics, relations, proof -- we're starting to sound like logicians. And indeed, in the verification world, there's a line of research into producing formally correct compilers. Time permitting, we'll talk a little about this at the end of the course.

There are lots of big, well-known and much-used compilers out there in the world. This course is not about those. It's totally reasonable to want to take a compilers course to learn about the intricate details and implementation of gcc, javac, etc. If this is what you're looking for, though, this course might not be for you. Similarly, compilers are sometimes seen as a heavy, techy, systems topic. We'll unavoidably be writing some assembly code in this course. But I am not much of a systems hacker myself, and if you're hoping to see how to bootstrap a compiler up from nothing, or dive headfirst into the complicated world of x86, you may find yourself disappointed. This course will take a somewhat higher-level perspective, looking at features and design options that appear in real-world compilers in perhaps slightly different forms.

Logistics

A few logistics points for now:

Our first compiler

OK: let's write a compiler! Eventually, we'll write a compiler for a programming language with the following features:

We'll tackle these and other features one at a time, slowly growing the language our compiler supports. We'll talk more about what this language will look like next time. For now, let's try to write a compiler for a very small subset of this language: the integers!

Here are some examples of programs written in this language:

42
10
-4
4000000000000

What should these programs do? Well, let's see what they do in Python:

$ python
>>> 42
42
>>> 10
10

It looks like these programs should sort of just return themselves.

Our compiler is going to target x86-64 machine code. That way, we won't need any other programs in order to interpret the programs we produce. What should a program to print out a number look like?

Printing a number, the hard way

We can find out by writing such a program in C:

program.c

#include <stdio.h>
#include <inttypes.h>

int64_t entry() {
    return 4000000000000;
}


int main(int argc, char **argv) {
    printf("%" PRIi64, entry());
    return 0;
}

This program is a bit more complex than it needs to be for now. We're separating out the entry and main functions because doing so will be useful later.

If you haven't seen C before, that's OK–all this program is doing is printing a string to the screen and then exiting. The reason we've written it in C is that we already have a compiler from C to machine code, gcc. So, let's make sure our program works:

$ gcc -o program program.c
$ ./printnum
4000000000000$

OK, it seems to do the right thing (it's just printing the number, not a newline). Let's try to take a look at the code it produced!

$ cat program # the [cat] command prints out the contents of a file

This command results in a lot of unreadable garbage. That's not very helpful! What's going on here? Some of that stuff is actually x86-64 machine code, in binary format. Other parts are sort of boilerplate that tells the operating system how to execute this program. We can get a more useful output like this:

$ gcc -S -masm=intel -m64 program.c
$ cat program.s
	.file	"program.c"
	.intel_syntax noprefix
	.text
	.globl	entry
	.type	entry, @function
entry:
.LFB0:
	.cfi_startproc
	endbr64
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	mov	rbp, rsp
	.cfi_def_cfa_register 6
	mov	eax, 4000000000
	pop	rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	entry, .-entry
	.section	.rodata
.LC0:
	.string	"%li"
	.text
	.globl	main
	.type	main, @function
main:
.LFB1:
	.cfi_startproc
	endbr64
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	mov	rbp, rsp
	.cfi_def_cfa_register 6
	sub	rsp, 16
	mov	DWORD PTR -4[rbp], edi
	mov	QWORD PTR -16[rbp], rsi
	mov	eax, 0
	call	entry
	mov	rsi, rax
	lea	rax, .LC0[rip]
	mov	rdi, rax
	mov	eax, 0
	call	printf@PLT
	mov	eax, 0
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE1:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 11.2.0-19ubuntu1) 11.2.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:

This is called assembly language. It's basically a human-readable form of machine-code. Translating between assembly language and machine code is a simple mechanical process; we call this assembling.

Building a runtime

Let's look at the code above. There are two big sections, starting with entry and main. (If you're following these notes on a Mac, you may see extra underscores.) These correspond to the two functions, entry and main, in our C code.

In order to write our compiler, we're going to replace the entry function (we've called it that because it's the entry point to the code our compiler is producing). We're going to replace it with code that looks a lot like the code up there, but simpler. We're going to split that code out into its own file, and that's what our compiler will produce.

First, we'll modify our C program:

runtime.c

#include <stdio.h>
#include <inttypes.h>

extern int64_t entry();

int main(int argc, char **argv) {
    printf("%" PRIi64, entry());
    return 0;
}

This program is called the runtime for our compiler. The runtime will be included in every program our compiler produces. Here, it just handles calling our function and printing the result. Later we'll extend it to do more things.

Note that we're writing the runtime in C, but we don't have to–we could write it in assembly, or some other language that can produce machine code. C is a convenient language for this purpose, but it's certainly not the only choice! Because we've written the runtime in C, C's runtime gets included in our programs as well.

We can compile the runtime like this:

$ gcc -c runtime.c -o runtime.o

Back to our entry function. We can write it in assembly language like this:

program.s

global entry
entry:
        mov rax, 4000000000000
        ret        

Line by line, this program:

  1. Declares that it's going to define one function, called entry
  2. Starts the entry function
  3. Writes the value 4000000000000 into the rax register. We'll talk more about registers later; for now, the important thing is that by convention in x86-64, the return value of a function goes in rax.
  4. Returns from the function.

We can build this program like this:

$ nasm program.s -f elf64 -o program.o # change "-f elf64" to  "-f macho64" if you're on a Mac

Finally, we can combine the runtime and the program:

$ gcc program.o runtime.o -o program
$ ./program
4000000000000$ 

Our own compiler

OK, so now we know what our compiler needs to do. It needs to turn a "program" like

4000000000000

into an assembly-language program like

global entry
entry:
        mov rax, 4000000000000
        ret        

Let's write a program to do this! We'll write our program in OCaml (a programming language we'll discuss in much more detail next time). Here's what it might look like:

compile.ml

let compile (program: string): string =
    String.concat "\n" 
    [ "global entry"; 
    "entry:"; 
    Printf.sprintf "\tmov rax, %s" program;
    "\tret" ]

This program just makes a list of our assembly-language lines and then glues them together with newlines.

We can see our compiler work if we open up an OCaml interaction environment by running utop:

>>> #use "compile.ml";;

>>> print_endline (compile "4000000000000");;
global entry
entry:
        mov rax, 4000000000000
        ret        

We can glue everything together like this:

compile.ml

let compile (program: string) : string =
    String.concat "\n" 
    ["global entry";
    "entry:";
    Printf.sprintf "\tmov rax, %s" program;
    "\tret"]

let compile_to_file (program: string): unit =
    let file = open_out "program.s" in
    output_string file (compile program);
    close_out file

let compile_and_run (program: string): string =
    compile_to_file program;
    ignore (Unix.system "nasm program.s -f elf64 -o program.o");
    ignore (Unix.system "gcc program.o runtime.o -o program");
    let inp = Unix.open_process_in "./program" in
    let r = input_line inp in
    close_in inp; r

So–that's our first compiler. Next time, we'll talk more about the language we're trying to compile. We'll also do a quick introduction to OCaml.