Homework 5 - Fun With files
Assignment Instructions
-
Accept the assignment on GitHub Classroom here.
-
Do the assignment 🐛.
-
Upload the assignment on Gradescope. The most convenient way to do this is by uploading your assignment repo through the integrated GitHub submission method on Gradescope, but you may also upload a .zip of your repo.
Introduction
In this homework, you'll implement string support and file handling. You'll get more practice dealing with data on the heap and learn how to extend your compiler's functionality via the C runtime.
Strings
Implement support for strings. For now, this just means adding support for
string literals (i.e., s-expressions built with the Str
constructor). String
literals are sequences of characters enclosed in double-quotes. Strings should
be displayed using the same double-quote-enclosed representation when they are
print
-ed.
Escaping
Strings may contain characters with special meaning (for example, newlines and
double-quotes). When displaying strings, the newline and double-quote special
characters should be printed as \n
and \"
, respectively, to ensure the
resulting expression is well-formed. For instance, a string containing only a
double-quote should be displayed as "\""
and not """
.
You do not need to support special characters besides quote and newline.
Interpreter
Add string support to the interpreter by adding a constructor to value
and
extending the interpreter's display_value
function to display strings. You may
use OCaml's built-in string
type for the underlying value, and you may use
String.escaped
to escape
special characters.
Compiler
In the compiler, we'll represent strings like C does: null-terminated sequences
of characters. (The null terminator is simply a byte with value 0
[in C, a
char
represented by '\0'
] that follows the bytes that contains the string's
characters.) The runtime value for a string should be a pointer to the first
character of the sequence, tagged with 0b011
.
Extend the runtime with support for displaying strings. In order to properly escape newlines and double-quotes, implement this as a loop over the string's characters, and check if each needs to be escaped.
There are multiple ways to store string values; for example, you could store string literals on the heap. In fact, we suggest that you do this when implementing input
in the latter part of this assignment. However, to implement string literals -- that is, explicit strings like "abc"
that appear in our Lisp programs directly -- we can use a more direct approach: the dq
assembly directive.
Remember in homework 3 that you used the dq
directive to "embed data in the executable". An executable is itself is a sequence of bytes, where most of those bytes represent machine code instructions. When the program is run, these bytes are loaded into memory, thus each byte of the executable has a memory address. When the assembler sees dq
with an argument, it inserts the argument directly into the bytes of the executable that it generates. Normally the argument to dq
is not itself an instruction, so we must make sure the processor will never try to execute these bytes of the executable. But if we label this location, we can find its address in memory using lea
, and thus access its contents.
But there's a catch. Remember that when we started putting pairs on the heap, we had to make sure that every address we used to represent pairs was a multiple 8, so that we had room on the end for a three-bit tag. We need to do the same here, this time by using the align
assembly directive. When the assembler sees align 8
, it ensures that the next instruction will be placed at a memory location that is a multiple of 8. It fills the required amount of padding by inserting some number of no-op instructions into the sequence of machine code. So if we align 8
before storing a string value with dq
, the string will be stored in the resulting program at an address whose last three bits are 0
. As noted above, you will need to tag string values with 0b011
.
For example, what would end up in rax
if you wrote the following assembly code?
lea rax, [my_label] jmp continue align 8 label my_label: dq 'its_friday' label continue:
After executing this code, rax
will contain the address of the string "its_friday"
. Think about what problems you might encounter if you move the line label my_label:
above the line align 8
(hint: think about where the "junk" bytes generated by align
will be placed in the generated machine code). Also, think about what problems you might encounter if you omit the jmp continue
line.
In the compiler, the lea
instruction above is represented by the LeaLabel
constructor, label
directives are represented by the Label
constructor, and dq
directives (as shown above) are represented by the DqString
constructor.
Note: DqString
automatically adds a null-terminator to the specified string data.
File I/O using channels
You're going to add a miniature version of OCaml's channel-based I/O to your language. First, you'll need to add support for channel types. Channels can be either input channels or output channels. Input channels can be read from; output channels can be written to.
The symbols stdin
and stdout
refer to the program's input and output.
stdin
is an input channel; stdout
is an output channel.
- In the interpreter, implement input channels with
in_channel
and output channels without_channel
. The AST elementsStdin
andStdout
should evaluate to the OCaml built-in valuesstdin
andstdout
, respectively. - In the compiler, implement both types of channels with the same mask,
0b111111111
. Input channels should have tag0b011111111
; output channels should have tag0b001111111
.stdin
should be represented at runtime as0
shifted left and tagged with the input-channel tag;stdout
should be represented as1
shifted left and tagged with the output channel tag. The0
and1
correspond to file numbers for the Unix standard input/output files.
Channels should be printed as <in-channel>
or <out-channel>
.
Now you'll implement some operations to open, close, and read and write via channels.
(open-in filename)
takes in a string representing a filename and opens it as an input channel, returning the channel(open-out filename)
takes in a string representing a filename and opens it as an output channel, returning the channel(close-in ch)
takes in an input channel and closes it, returningtrue
(close-out ch)
takes in an output channel and closes it, returningtrue
(input ch n)
readsn
bytes from input channelch
, returning a string of those bytes (plus a null terminator)(output ch s)
writes the (null-terminated) strings
out to output channelch
, returningtrue
In the interpreter, these should be implemented using OCaml's built-in functions:
open-in
should be implemented usingopen_in
open-out
should be implemented usingopen_out
close-in
should be implemented usingclose_in
close-out
should be implemented usingclose_out
input
should be implemented usingreally_input_string
output
should be implemented usingoutput_string
In the compiler, all of these operations will be implemented with C functions
that you add to the runtime. You should define a C function for each operation
(except for close-in
and close-out
, which should be implemented identically
in the runtime, using the same C function).
By convention, C functions will look for their first argument in rdi
, so you
should use this register to pass arguments to open-in
, etc.
- Your functions for
open-in
andopen-out
should use theopen_for_reading
andopen_for_writing
helper functions we have provided inruntime.c
. Both functions return a file descriptor as an integer. Turn this file descriptor into a channel by shifting it and tagging it with the correct channel tag. - Your function(s) for
close-in
andclose-out
should use theclose
C function, passing the file descriptor obtained fromopen_for_reading
oropen_for_writing
(remember to remove the tag and shift appropriately). Input and output channels can be handled exactly the same. - Your function for
input
should use theread_all
helper function we've provided inruntime.c
; this helper function takes care of reading the correct number of bytes, adding a null terminator, and handling errors. Sinceread_all
reads into a buffer, you'll need to pass in a pointer to the Lisp heap (in other words, you'll have to pass the heap pointer as an argument to your function). After you callread_all
, you'll need to update your heap pointer to reflect the space used for reading; you'll also need to make sure that the heap pointer remains a multiple of 8. We recommend doing this by returning the heap adjustment from your C function as an integer (taking care to make sure it's a multiple of 8), but there are other ways of doing it. output
should be implemented using thewrite_all
helper function we have provided inruntime.c
.
Note: Our autograder relies on stdout, so do not try and close stdin/stdout. You can consider this undefined behavior. We will not evaluate your solution on whether or not you can close stdin or stdout!
Testing
For this assignment, you should be sure to check that the types of the inputs are as expected. Furthermore, this assignment adds a couple of testing features to help with file I/O.
Printing values
At the end of Lecture 12, we made a change to the behavior of our interpreter and our compiler runtime.
By default, they will no longer print the result of evaluating a program.
Interpreting (or compiling and running) the program (add1 10)
will not produce any output.
To replicate the old behavior, you can run the program (print (add1 10))
,
or you can output a string to stdout
.
Providing program input
Your programs can now read from standard input, using both the read-num
operator defined in class and the input
operator you'll write on this
assignment. For testing, you should provide inputs by writing .in
files for
each .lisp
file that expects an input. For example, if we defined a program
like this in examples/read-num.lisp
:
(print (pair (read-num) (pair (read-num) (read-num))))
We could define its input in examples/read-num.in
:
8
13
21
The testing system will provide this as the input to both the interpreter and the compiler.
If you are using the batch testing script index.in
, you can write these input values as an optional fourth part of a test:
readnumpair|
(print (pair (read-num) (pair (read-num) (read-num))))|
(pair 40 (pair 35 112))|
40
35
112
The tmp
directory
When testing reading/writing to files, it can be easy for state to get mixed up
between tests (e.g. one test creates a file and writes some things to it, while
another test expects the file to not exist). To prevent this situation, our test
runner will create a temporary directory called tmp
, and will empty this
directory before each test, thus ensuring that every test starts with an empty
tmp
directory. You must use this directory when writing files in your tests
(i.e. all paths passed to open-out
must start with tmp/...
). This
restriction is checked when your tests are run by the autograder.
For example, to test writing a file you might use a test program like the following:
(let ((f (open-out "tmp/test.txt")))
<do something with f>)
You may also add files to a data
folder located in the root folder for this
project. These files will be accessible by test programs, to allow you to test
input channels separate from output channels. However, you may not write to these files for the reasons explained above.
For example, you might add the file data/test.txt
with the contents hello
,
and then include the following program in examples/read-file.lisp
:
(let ((f (open-in "data/test.txt")))
(print (input f 4)))
Grammar
The grammar your new interpreter and compiler should support is as follows:
<expr> ::= <num> | <id> + | <string> | true | false + | stdin + | stdout - | () | (<z-prim>) | (<un_prim> <expr>) | (<bin_prim> <expr> <expr>) | (if <expr> <expr> <expr>) | (let ((<id> <expr>) ...) <expr>) | (do <expr> <expr> ...) <z-prim> ::= read-num | newline <un_prim> ::= add1 | sub1 | zero? | num? | not | pair? | left | right | print + | open-in + | open-out + | close-in + | close-out <bin_prim> ::= + | - | = | < | pair + | input + | output