The parser takes as input a Charme program string, and produces as output a nested list that encodes the structure of the input program. The first step is to break the input string into tokens; this is done by the
tokenize procedure defined in the previous section.
The next step is to take the list of tokens and produce a data structure that encodes the structure of the input program. Since the Charme language is built from simple parenthesized expressions, we can represent the parsed program as a list. But, unlike the list returned by
tokenize which is a flat list containing the tokens in order, the list returned by
parse is a structured list that may have lists (and lists of lists, etc.) as elements.
Charme's syntax is very simple, so the parser can be implemented by just breaking an expression into its components using the parentheses and whitespace. The parser needs to balance the open and close parentheses that enclose expressions. For example, if the input string is
the output of
tokenizer is the list:
The parser structures the tokens according to the program structure, producing a parse tree that encodes the structure of the input program. The parenthesis provide the program structure, so are removed from the parse tree. For the example, the resulting parse tree is:
['*', 'x', 'x'] ] ]
Here is the definition of
def parse_tokens(tokens, inner):
res = 
while len(tokens) > 0:
current = tokens.pop(0)
if current == '(':
res.append (parse_tokens(tokens, True))
elif current == ')':
if inner: return res
error('Unmatched close paren: ' + s)
error ('Unmatched open paren: ' + s)
return parse_tokens(tokenize(s), False)
The input to
parse is a string in the target language. The output is a list of the parenthesized expressions in the input. Here are some examples:
The parentheses are no longer included as tokens in the result, but their presence in the input string determines the structure of the result.
parse procedure implements a recursive descent parser. The main
parse procedure defines the
parse_tokens helper procedure and returns the result of calling it with inputs that are the result of tokenizing the input string and the Boolean literal
return parse_tokens(tokenize(s), False).
parse_tokens procedure takes two inputs:
tokens, a list of tokens (that results from the
tokenize procedure); and
inner, a Boolean that indicates whether the parser is inside a parenthesized expression. The value of
False for the initial call since the parser starts outside a parenthesized expression. All of the recursive calls result from encountering a
'(', so the value passed as
True for all the recursive calls.
The body of the
parse_tokens procedure initializes
res to an empty list that is used to store the result. Then, the
while statement iterates as long as the token list contains at least one element.
The first statement of the
while statement block assigns
pop method of the list takes a parameter that selects an element from the list. The selected element is returned as the result. The
pop method also mutates the list object by removing the selected element. So,
tokens.pop(0) returns the first element of the
tokens list and removes that element from the list. This is essential to the parser making progress: every time the
tokens.pop(0) expression is evaluated the number of elements in the token list is reduced by one.
current token is an open parenthesis,
parse_tokens is called recursively to parse the inner expression (that is, all the tokens until the matching close parenthesis). The result is a list of tokens, which is appended to the result. If the
current token is a close parenthesis, the behavior depends on whether or not the parser is parsing an inner expression. If it is inside an expression (that is, an open parenthesis has been encountered with no matching close parenthesis yet), the close parenthesis closes the inner expression, and the result is returned. If it is not in an inner expression, the close parenthesis has no matching open parenthesis so a parse error is reported.
else clause deals with all other tokens by appending them to the list.
The final if statement checks that the parser is not in an inner context when the input is finished. This would mean there was an open parenthesis without a corresponding close, so an error is reported. Otherwise, the list representing the parse tree is returned.