Homework 5 - Fun With files

Assignment Instructions

  1. Accept the assignment on GitHub Classroom here.

  2. Do the assignment 🐛.

  3. Upload the assignment on Gradescope. The most convenient way to do this is by uploading your assignment repo through the integrated GitHub submission method on Gradescope, but you may also upload a .zip of your repo.

Introduction

In this homework, you'll implement string support and file handling. You'll get more practice dealing with data on the heap and learn how to extend your compiler's functionality via the C runtime.

Strings

Implement support for strings. For now, this just means adding support for string literals (i.e., s-expressions built with the Str constructor). String literals are sequences of characters enclosed in double-quotes. Strings should be displayed using the same double-quote-enclosed representation when they are print-ed.

Escaping

Strings may contain characters with special meaning (for example, newlines and double-quotes). When displaying strings, the newline and double-quote special characters should be printed as \n and \", respectively, to ensure the resulting expression is well-formed. For instance, a string containing only a double-quote should be displayed as "\"" and not """.

You do not need to support special characters besides quote and newline.

Interpreter

Add string support to the interpreter by adding a constructor to value and extending the interpreter's display_value function to display strings. You may use OCaml's built-in string type for the underlying value, and you may use String.escaped to escape special characters.

Compiler

In the compiler, we'll represent strings like C does: null-terminated sequences of characters. (The null terminator is simply a byte with value 0 [in C, a char represented by '\0'] that follows the bytes that contains the string's characters.) The runtime value for a string should be a pointer to the first character of the sequence, tagged with 0b011.

Extend the runtime with support for displaying strings. In order to properly escape newlines and double-quotes, implement this as a loop over the string's characters, and check if each needs to be escaped.

There are multiple ways to store string values; for example, you could store string literals on the heap. In fact, we suggest that you do this when implementing input in the latter part of this assignment. However, to implement string literals -- that is, explicit strings like "abc" that appear in our Lisp programs directly -- we can use a more direct approach: the dq assembly directive.

Remember in homework 3 that you used the dq directive to "embed data in the executable". An executable is itself is a sequence of bytes, where most of those bytes represent machine code instructions. When the program is run, these bytes are loaded into memory, thus each byte of the executable has a memory address. When the assembler sees dq with an argument, it inserts the argument directly into the bytes of the executable that it generates. Normally the argument to dq is not itself an instruction, so we must make sure the processor will never try to execute these bytes of the executable. But if we label this location, we can find its address in memory using lea, and thus access its contents.

But there's a catch. Remember that when we started putting pairs on the heap, we had to make sure that every address we used to represent pairs was a multiple 8, so that we had room on the end for a three-bit tag. We need to do the same here, this time by using the align assembly directive. When the assembler sees align 8, it ensures that the next instruction will be placed at a memory location that is a multiple of 8. It fills the required amount of padding by inserting some number of no-op instructions into the sequence of machine code. So if we align 8 before storing a string value with dq, the string will be stored in the resulting program at an address whose last three bits are 0. As noted above, you will need to tag string values with 0b011.

For example, what would end up in rax if you wrote the following assembly code?

  lea rax, [my_label]
  jmp continue
  align 8
label my_label:
  dq 'its_friday'
label continue:

After executing this code, rax will contain the address of the string "its_friday". Think about what problems you might encounter if you move the line label my_label: above the line align 8 (hint: think about where the "junk" bytes generated by align will be placed in the generated machine code). Also, think about what problems you might encounter if you omit the jmp continue line.

In the compiler, the lea instruction above is represented by the LeaLabel constructor, label directives are represented by the Label constructor, and dq directives (as shown above) are represented by the DqString constructor.

Note: DqString automatically adds a null-terminator to the specified string data.

File I/O using channels

You're going to add a miniature version of OCaml's channel-based I/O to your language. First, you'll need to add support for channel types. Channels can be either input channels or output channels. Input channels can be read from; output channels can be written to.

The symbols stdin and stdout refer to the program's input and output. stdin is an input channel; stdout is an output channel.

Channels should be printed as <in-channel> or <out-channel>.

Now you'll implement some operations to open, close, and read and write via channels.

In the interpreter, these should be implemented using OCaml's built-in functions:

In the compiler, all of these operations will be implemented with C functions that you add to the runtime. You should define a C function for each operation (except for close-in and close-out, which should be implemented identically in the runtime, using the same C function).

By convention, C functions will look for their first argument in rdi, so you should use this register to pass arguments to open-in, etc.

Note: Our autograder relies on stdout, so do not try and close stdin/stdout. You can consider this undefined behavior. We will not evaluate your solution on whether or not you can close stdin or stdout!

Testing

For this assignment, you should be sure to check that the types of the inputs are as expected. Furthermore, this assignment adds a couple of testing features to help with file I/O.

Printing values

At the end of Lecture 12, we made a change to the behavior of our interpreter and our compiler runtime. By default, they will no longer print the result of evaluating a program. Interpreting (or compiling and running) the program (add1 10) will not produce any output. To replicate the old behavior, you can run the program (print (add1 10)), or you can output a string to stdout.

Providing program input

Your programs can now read from standard input, using both the read-num operator defined in class and the input operator you'll write on this assignment. For testing, you should provide inputs by writing .in files for each .lisp file that expects an input. For example, if we defined a program like this in examples/read-num.lisp:

(print (pair (read-num) (pair (read-num) (read-num))))

We could define its input in examples/read-num.in:

8
13
21

The testing system will provide this as the input to both the interpreter and the compiler.

If you are using the batch testing script index.in, you can write these input values as an optional fourth part of a test:

readnumpair|
(print (pair (read-num) (pair (read-num) (read-num))))|
(pair 40 (pair 35 112))|
40
35
112

The tmp directory

When testing reading/writing to files, it can be easy for state to get mixed up between tests (e.g. one test creates a file and writes some things to it, while another test expects the file to not exist). To prevent this situation, our test runner will create a temporary directory called tmp, and will empty this directory before each test, thus ensuring that every test starts with an empty tmp directory. You must use this directory when writing files in your tests (i.e. all paths passed to open-out must start with tmp/...). This restriction is checked when your tests are run by the autograder.

For example, to test writing a file you might use a test program like the following:

(let ((f (open-out "tmp/test.txt")))
  <do something with f>)

You may also add files to a data folder located in the root folder for this project. These files will be accessible by test programs, to allow you to test input channels separate from output channels. However, you may not write to these files for the reasons explained above.

For example, you might add the file data/test.txt with the contents hello, and then include the following program in examples/read-file.lisp:

(let ((f (open-in "data/test.txt")))
  (print (input f 4)))

Grammar

The grammar your new interpreter and compiler should support is as follows:

<expr> ::= <num>
         | <id>
+        | <string>
         | true
         | false
+        | stdin
+        | stdout
-        | ()
         | (<z-prim>)
         | (<un_prim> <expr>)
         | (<bin_prim> <expr> <expr>)
         | (if <expr> <expr> <expr>)
         | (let ((<id> <expr>) ...) <expr>)
         | (do <expr> <expr> ...)

<z-prim> ::= read-num | newline

<un_prim> ::= add1
            | sub1
            | zero?
            | num?
            | not
            | pair?
            | left
            | right
            | print
+           | open-in
+           | open-out
+           | close-in
+           | close-out

<bin_prim> ::= +
             | -
             | =
             | <
             | pair
+            | input
+            | output