11.2 Parser

The parser takes as input a Charme program string, and produces as output a nested list that encodes the structure of the input program. The first step is to break the input string into tokens; this is done by the tokenize procedure defined in the previous section.

The next step is to take the list of tokens and produce a data structure that encodes the structure of the input program. Since the Charme language is built from simple parenthesized expressions, we can represent the parsed program as a list. But, unlike the list returned by tokenize which is a flat list containing the tokens in order, the list returned by parse is a structured list that may have lists (and lists of lists, etc.) as elements.

Charme's syntax is very simple, so the parser can be implemented by just breaking an expression into its components using the parentheses and whitespace. The parser needs to balance the open and close parentheses that enclose expressions. For example, if the input string is

(define square (lambda (x) (* x x)))

the output of tokenizer is the list:

['(', 'define', 'square', '(', 'lambda', '(', 'x', ')', '(', '*', 'x', 'x', ')', ')', ')']

The parser structures the tokens according to the program structure, producing a parse tree that encodes the structure of the input program. The parenthesis provide the program structure, so are removed from the parse tree. For the example, the resulting parse tree is:

['define',
 'square',
 [  'lambda',
    ['x'],
    ['*', 'x', 'x'] ] ]

Here is the definition of parse:

def parse(s):
    def parse_tokens(tokens, inner):
       res = []
       while len(tokens) > 0:
          current = tokens.pop(0)
          if current == '(':
             res.append (parse_tokens(tokens, True))
          elif current == ')':
             if inner: return res
             else:
                error('Unmatched close paren: ' + s)
                return None
          else:
             res.append(current)
       
       if inner:
          error ('Unmatched open paren: ' + s)
          return None
       else:
          return res

    return parse_tokens(tokenize(s), False)

The input to parse is a string in the target language. The output is a list of the parenthesized expressions in the input. Here are some examples:

The parentheses are no longer included as tokens in the result, but their presence in the input string determines the structure of the result.

The parse procedure implements a recursive descent parser. The main parse procedure defines the parse_tokens helper procedure and returns the result of calling it with inputs that are the result of tokenizing the input string and the Boolean literal False: return parse_tokens(tokenize(s), False).

The parse_tokens procedure takes two inputs: tokens, a list of tokens (that results from the tokenize procedure); and inner, a Boolean that indicates whether the parser is inside a parenthesized expression. The value of inner is False for the initial call since the parser starts outside a parenthesized expression. All of the recursive calls result from encountering a '(', so the value passed as inner is True for all the recursive calls.

The body of the parse_tokens procedure initializes res to an empty list that is used to store the result. Then, the while statement iterates as long as the token list contains at least one element.

The first statement of the while statement block assigns tokens.pop(0) to current. The pop method of the list takes a parameter that selects an element from the list. The selected element is returned as the result. The pop method also mutates the list object by removing the selected element. So, tokens.pop(0) returns the first element of the tokens list and removes that element from the list. This is essential to the parser making progress: every time the tokens.pop(0) expression is evaluated the number of elements in the token list is reduced by one.

If the current token is an open parenthesis, parse_tokens is called recursively to parse the inner expression (that is, all the tokens until the matching close parenthesis). The result is a list of tokens, which is appended to the result. If the current token is a close parenthesis, the behavior depends on whether or not the parser is parsing an inner expression. If it is inside an expression (that is, an open parenthesis has been encountered with no matching close parenthesis yet), the close parenthesis closes the inner expression, and the result is returned. If it is not in an inner expression, the close parenthesis has no matching open parenthesis so a parse error is reported.

The else clause deals with all other tokens by appending them to the list.

The final if statement checks that the parser is not in an inner context when the input is finished. This would mean there was an open parenthesis without a corresponding close, so an error is reported. Otherwise, the list representing the parse tree is returned.