Homework 7 - Parsing

Assignment Instructions

  1. Accept the assignment on GitHub Classroom here.

  2. Do the assignment 🐫.

  3. Upload the assignment on Gradescope. The most convenient way to do this is by uploading your assignment repo through the integrated GitHub submission method on Gradescope, but you may also upload a .zip of your repo.

Introduction

In this homework, you'll implement a top-down predictive parser for an alternative syntax for our language. This syntax is called ML (Brown), because it bears a vague resemblance to ML-family languages like OCaml. We'll usually refer to it as MLB. Here's an example of an MLB program:

function add_up(a, b, c) =
  a + b = c

let x = 
  if 
    add_up(read_num(), read_num(), read_num())
  then 1
  else 2 
in
print(x)

Unlike the previous assignments, you won't be modifying either the interpreter or the compiler. We've provided an AST-based version of the HW6 compiler and interpreter, and functions to produce the AST from S-expressions. You'll write a parser that produces the same AST but that instead reads in MLB-format source code.

MLB syntax

Here's a grammar (like the ones we discussed in class) for MLB:

<program> ::= <defns> <expr>

<defns> ::=
  | epsilon
  | <defn> <defns>

<defn> ::=
  | FUNCTION ID LPAREN <params> EQ <expr>

<params> ::=
  | RPAREN
  | ID <rest-params>

<rest-params> ::=
  | RPAREN
  | COMMA ID <rest-params>

<expr> ::=
  | IF <expr> THEN <expr> ELSE <expr>
  | LET ID EQ <expr> IN <expr>
  | <seq>

<seq> ::=
  | <infix1> <rest-seq>

<rest-seq> ::=
  | epsilon
  | SEMICOLON <infix1> <rest-seq>

<infix1> ::=
  | <infix2> <infix1'>

<infix1'> ::=
  | epsilon
  | EQ <infix1>
  | LT <infix1>

<infix2> ::=
  | <term> <infix2'>

<infix2'> ::=
  | epsilon
  | PLUS <infix2>
  | MINUS <infix2>

<term> ::=
  | ID
  | ID LPAREN <args>
  | NUM
  | LPAREN <expr> RPAREN

<args> ::=
  | RPAREN
  | <expr> <rest-args>

<rest-args> ::=
  | RPAREN
  | COMMA <expr> <rest-args>

This grammar does not have any left-recursion or left-ambiguity (the only exception is in <term>, where you can easily handle the two ID cases with careful pattern-matching). We recommend writing a recursive-descent parser like the ones we developed in class:

The code

You'll write your parser in mlb_syntax/parser.ml. There is a tokenizer implemented in mlb_syntax/tokenizer.ml; it shouldn't be necessary to change it, but you can if you want to.

The AST you'll produce is defined in ast/ast.ml; it's quite similar to the AST we defined in class. A few hints for mapping the MLB grammar to the AST:

We've provided one other helper function: consume_token checks to see that the head of a token list is what you want it to be, returning the tail of the list if it is and raising an error otherwise. We've also provided a few parse_ functions to serve as a starting point for the parser. You shouldn't need to change the top-level parse_program function, but you'll need to fill in the bodies of parse_defns and parse_expr and add additional non-terminal parsing functions.

Testing

We've extended the tester to support programs in the new syntax. You can write MLB-formatted examples either by:

Note that in general, the interpreter and the compiler will give the same result on all of your programs! You'll probably want to write .out files, or include expected output in the .tsv file, to make sure your parser is actually working. These work exactly the same as with .lisp files.

On this homework more than on previous ones, it may be useful to run your functions in an OCaml shell. You can do that by running dune utop from the hw7 directory, then entering e.g.

> open Mlb_syntax.Tokenizer;;
> open Mlb_syntax.Parser;;
> tokenize "1 + 3" |> parse_program;;

A word on associativity

With the grammar specified above, the MLB expression 2 + 3 + 4 will parse to something like (in S-expression syntax): (+ 2 (+ 3 4).

This is a little different from what we'd usually expect: addition is generally defined to left-associative. Most languages parse that same expression to: (+ (+ 2 3) 4)

For addition, this doesn't really matter--since it's associative, those expressions evaluate to the same thing. This can lead to weird behavior on subtraction, though: the expression 10 - 3 - 2 should probably evaluate to 5, but if you implement the grammar as specified above it will instead evaluate to 9 (i.e., 10 - (3 - 2)).

If you finish your parser early, try to fix this! There's more than one way to do it, but one way to get started would be to take a look at the <seq> and <rest-seq> non-terminals, which are used to get a list of expressions. Could you do something similar to get a list of terms, then transform the list into an AST of the correct shape?

There's no extra credit available for doing this--it's just for "fun."