Homework 5 - Fun With files

Assignment Instructions

Accept the assignment on GitHub Classroom here.
Do the assignment 🐛.
Upload the assignment on Gradescope. The most convenient way to do this is by uploading your assignment repo through the integrated GitHub submission method on Gradescope, but you may also upload a .zip of your repo.

Introduction

In this homework, you'll implement string support and file handling. You'll get more practice dealing with data on the heap and learn how to extend your compiler's functionality via the C runtime.

Strings

Implement support for strings. For now, this just means adding support for string literals (i.e., s-expressions built with the Str constructor). String literals are sequences of characters enclosed in double-quotes. Strings should be displayed using the same double-quote-enclosed representation when they are print-ed.

Escaping

Strings may contain characters with special meaning (for example, newlines and double-quotes). When displaying strings, the newline and double-quote special characters should be printed as \n and \", respectively, to ensure the resulting expression is well-formed. For instance, a string containing only a double-quote should be displayed as "\"" and not """.

You do not need to support special characters besides quote and newline.

Interpreter

Add string support to the interpreter by adding a constructor to value and extending the interpreter's display_value function to display strings. You may use OCaml's built-in string type for the underlying value, and you may use String.escaped to escape special characters.

Compiler

In the compiler, we'll represent strings like C does: null-terminated sequences of characters. (The null terminator is simply a byte with value 0 [in C, a char represented by '\0'] that follows the bytes that contains the string's characters.) The runtime value for a string should be a pointer to the first character of the sequence, tagged with 0b011.

Extend the runtime with support for displaying strings. In order to properly escape newlines and double-quotes, implement this as a loop over the string's characters, and check if each needs to be escaped.

There are multiple ways to store string values; for example, you could store string literals on the heap. In fact, we suggest that you do this when implementing input in the latter part of this assignment. However, to implement string literals -- that is, explicit strings like "abc" that appear in our Lisp programs directly -- we can use a more direct approach: the dq assembly directive.

Remember in homework 3 that you used the dq directive to "embed data in the executable". An executable is itself is a sequence of bytes, where most of those bytes represent machine code instructions. When the program is run, these bytes are loaded into memory, thus each byte of the executable has a memory address. When the assembler sees dq with an argument, it inserts the argument directly into the bytes of the executable that it generates. Normally the argument to dq is not itself an instruction, so we must make sure the processor will never try to execute these bytes of the executable. But if we label this location, we can find its address in memory using lea, and thus access its contents.

But there's a catch. Remember that when we started putting pairs on the heap, we had to make sure that every address we used to represent pairs was a multiple 8, so that we had room on the end for a three-bit tag. We need to do the same here, this time by using the align assembly directive. When the assembler sees align 8, it ensures that the next instruction will be placed at a memory location that is a multiple of 8. It fills the required amount of padding by inserting some number of no-op instructions into the sequence of machine code. So if we align 8 before storing a string value with dq, the string will be stored in the resulting program at an address whose last three bits are 0. As noted above, you will need to tag string values with 0b011.

For example, what would end up in rax if you wrote the following assembly code?

  lea rax, [my_label]
  jmp continue
  align 8
label my_label:
  dq 'its_friday'
label continue:

After executing this code, rax will contain the address of the string "its_friday". Think about what problems you might encounter if you move the line label my_label: above the line align 8 (hint: think about where the "junk" bytes generated by align will be placed in the generated machine code). Also, think about what problems you might encounter if you omit the jmp continue line.

In the compiler, the lea instruction above is represented by the LeaLabel constructor, label directives are represented by the Label constructor, and dq directives (as shown above) are represented by the DqString constructor.

Note: DqString automatically adds a null-terminator to the specified string data.

File I/O using channels

You're going to add a miniature version of OCaml's channel-based I/O to your language. First, you'll need to add support for channel types. Channels can be either input channels or output channels. Input channels can be read from; output channels can be written to.

The symbols stdin and stdout refer to the program's input and output. stdin is an input channel; stdout is an output channel.

In the interpreter, implement input channels with in_channel and output channels with out_channel. The AST elements Stdin and Stdout should evaluate to the OCaml built-in values stdin and stdout, respectively.
In the compiler, implement both types of channels with the same mask, 0b111111111. Input channels should have tag 0b011111111; output channels should have tag 0b001111111. stdin should be represented at runtime as 0 shifted left and tagged with the input-channel tag; stdout should be represented as 1 shifted left and tagged with the output channel tag. The 0 and 1 correspond to file numbers for the Unix standard input/output files.

Channels should be printed as <in-channel> or <out-channel>.

Now you'll implement some operations to open, close, and read and write via channels.

(open-in filename) takes in a string representing a filename and opens it as an input channel, returning the channel
(open-out filename) takes in a string representing a filename and opens it as an output channel, returning the channel
(close-in ch) takes in an input channel and closes it, returning true
(close-out ch) takes in an output channel and closes it, returning true
(input ch n) reads n bytes from input channel ch, returning a string of those bytes (plus a null terminator)
(output ch s) writes the (null-terminated) string s out to output channel ch, returning true

In the interpreter, these should be implemented using OCaml's built-in functions:

open-in should be implemented using open_in
open-out should be implemented using open_out
close-in should be implemented using close_in
close-out should be implemented using close_out
input should be implemented using really_input_string
output should be implemented using output_string

In the compiler, all of these operations will be implemented with C functions that you add to the runtime. You should define a C function for each operation (except for close-in and close-out, which should be implemented identically in the runtime, using the same C function).

By convention, C functions will look for their first argument in rdi, so you should use this register to pass arguments to open-in, etc.

Your functions for open-in and open-out should use the open_for_reading and open_for_writing helper functions we have provided in runtime.c. Both functions return a file descriptor as an integer. Turn this file descriptor into a channel by shifting it and tagging it with the correct channel tag.
Your function(s) for close-in and close-out should use the close C function, passing the file descriptor obtained from open_for_reading or open_for_writing (remember to remove the tag and shift appropriately). Input and output channels can be handled exactly the same.
Your function for input should use the read_all helper function we've provided in runtime.c; this helper function takes care of reading the correct number of bytes, adding a null terminator, and handling errors. Since read_all reads into a buffer, you'll need to pass in a pointer to the Lisp heap (in other words, you'll have to pass the heap pointer as an argument to your function). After you call read_all, you'll need to update your heap pointer to reflect the space used for reading; you'll also need to make sure that the heap pointer remains a multiple of 8. We recommend doing this by returning the heap adjustment from your C function as an integer (taking care to make sure it's a multiple of 8), but there are other ways of doing it.
output should be implemented using the write_all helper function we have provided in runtime.c.

Note: Our autograder relies on stdout, so do not try and close stdin/stdout. You can consider this undefined behavior. We will not evaluate your solution on whether or not you can close stdin or stdout!

Testing

For this assignment, you should be sure to check that the types of the inputs are as expected. Furthermore, this assignment adds a couple of testing features to help with file I/O.

Printing values

At the end of Lecture 12, we made a change to the behavior of our interpreter and our compiler runtime. By default, they will no longer print the result of evaluating a program. Interpreting (or compiling and running) the program (add1 10) will not produce any output. To replicate the old behavior, you can run the program (print (add1 10)), or you can output a string to stdout.

Providing program input

Your programs can now read from standard input, using both the read-num operator defined in class and the input operator you'll write on this assignment. For testing, you should provide inputs by writing .in files for each .lisp file that expects an input. For example, if we defined a program like this in examples/read-num.lisp:

(print (pair (read-num) (pair (read-num) (read-num))))

We could define its input in examples/read-num.in:

8
13
21

The testing system will provide this as the input to both the interpreter and the compiler.

If you are using the batch testing script index.in, you can write these input values as an optional fourth part of a test:

readnumpair|
(print (pair (read-num) (pair (read-num) (read-num))))|
(pair 40 (pair 35 112))|
40
35
112

The `tmp` directory

When testing reading/writing to files, it can be easy for state to get mixed up between tests (e.g. one test creates a file and writes some things to it, while another test expects the file to not exist). To prevent this situation, our test runner will create a temporary directory called tmp, and will empty this directory before each test, thus ensuring that every test starts with an empty tmp directory. You must use this directory when writing files in your tests (i.e. all paths passed to open-out must start with tmp/...). This restriction is checked when your tests are run by the autograder.

For example, to test writing a file you might use a test program like the following:

(let ((f (open-out "tmp/test.txt")))
  <do something with f>)

You may also add files to a data folder located in the root folder for this project. These files will be accessible by test programs, to allow you to test input channels separate from output channels. However, you may not write to these files for the reasons explained above.

For example, you might add the file data/test.txt with the contents hello, and then include the following program in examples/read-file.lisp:

(let ((f (open-in "data/test.txt")))
  (print (input f 4)))

Grammar

The grammar your new interpreter and compiler should support is as follows:

<expr> ::= <num>
         | <id>
+        | <string>
         | true
         | false
+        | stdin
+        | stdout
-        | ()
         | (<z-prim>)
         | (<un_prim> <expr>)
         | (<bin_prim> <expr> <expr>)
         | (if <expr> <expr> <expr>)
         | (let ((<id> <expr>) ...) <expr>)
         | (do <expr> <expr> ...)

<z-prim> ::= read-num | newline

<un_prim> ::= add1
            | sub1
            | zero?
            | num?
            | not
            | pair?
            | left
            | right
            | print
+           | open-in
+           | open-out
+           | close-in
+           | close-out

<bin_prim> ::= +
             | -
             | =
             | <
             | pair
+            | input
+            | output