2.4 Replacement grammars

Another way to define a language is to use a grammar. This is the most common way languages are defined by computer scientists today, and the way we will use for the rest of this book.

A grammar is a set of rules for generating all strings in the language. We use the Backus-Naur Form (BNF) notation to define a grammar. BNF grammars are exactly as powerful as recursive transition networks (Exploration 2.1 explains what this means and why it is the case), but easier to write down.

BNF was invented by John Backus in the late 1950s. Backus led efforts at IBM to define and implement Fortran, the first widely used programming language. Fortran enabled computer programs to be written in a language more like familiar algebraic formulas than low-level machine instructions, enabling programs to be written more quickly. In defining the Fortran language, Backus and his team used ad hoc English descriptions to define the language.

John Backus

These ad hoc descriptions were often misinterpreted, motivating the need for a more precise way of defining a language.

Rules in a Backus-Naur Form grammar have the form:

nonterminal ::$\Rightarrow$ replacement

I flunked out every year. I never studied. I hated studying. I was just goofing around. It had the delightful consequence that every year I went to summer school in New Hampshire where I spent the summer sailing and having a nice time. John Backus

The left side of a rule is always a single symbol, known as a nonterminal since it can never appear in the final generated string. The right side of a rule contains one or more symbols. These symbols may include nonterminals, which will be replaced using replacement rules before generating the final string. They may also be terminals, which are output symbols that never appear as the left side of a rule. When we describe grammars, we use italics to represent nonterminal symbols, and bold to represent terminal symbols. The terminals are the primitives in the language; the grammar rules are its means of combination.

We can generate a string in the language described by a replacement grammar by starting from a designated start symbol (e.g., sentence), and at each step selecting a nonterminal in the working string, and replacing it with the right side of a replacement rule whose left side matches the nonterminal. Wherever we find a nonterminal on the left side of a rule, we can replace it with what appears on the right side of any rule where that nonterminal matches the left side. A string is generated once there are no nonterminals remaining.

Here is an example BNF grammar (that describes the same language as the RTN in Figure 2.1):

1. Sentence :: $\Rightarrow$ Noun Verb
2. Noun :: $\Rightarrow$ Alice
3. Noun :: $\Rightarrow$ Bob
4. Verb :: $\Rightarrow$ Jumps
5. Verb :: $\Rightarrow$ runs

Starting from Sentence, the grammar can generate four sentences: “Alice jumps”, “Alice runs”, “Bob jumps”, and “Bob runs”.

A derivation shows how a grammar generates a given string. Here is the derivation of “Alice runs”:

Sentence :: $\Rightarrow$ $\underline{\text{Noun}}$ Verb using Rule 1
:: $\Rightarrow$ Alice Verb replacing Noun using Rule 2
:: $\Rightarrow$ Alice runs replacing Verb using Rule 5

We can represent a grammar derivation as a tree, where the root of the tree is the starting nonterminal (Sentence in this case), and the leaves of the tree are the terminals that form the derived sentence. Such a tree is known as a parse tree. Here is the parse tree for the derivation of “Alice runs”:

BNF grammars can be more compact than just listing strings in the language since a grammar can have many replacements for each nonterminal. For example, adding the rule, Noun ::$\Rightarrow $ Colleen, to the grammar adds two new strings (“Colleen runs” and “Colleen jumps”) to the language.

Recursive Grammars.recursive grammarThe real power of BNF as a compact notation for describing languages, though, comes once we start adding recursive rules to our grammar. A grammar is recursive if the grammar contains a nonterminal that can produce a production that contains itself.

Suppose we add the rule,

Sentence ::$\Rightarrow $ Sentence and Sentence

to our example grammar. Now, how many sentences can we generate?

Infinitely many! This grammar describes the same language as the RTN in Figure 2.2. It can generate "Alice runs and Bob jumps" and "Alice runs and Bob jumps and Alice runs" and sentences with any number of repetitions of "Alice runs". This is very powerful: by using recursive rules a compact grammar can be used to define a language containing infinitely many strings.

Example 2.1: Whole Numbers

This grammar defines the language of the whole numbers (0, 1,$\ldots $) with leading zeros allowed:

Number :: $\Rightarrow$ Digit MoreDigits
MoreDigits :: $\Rightarrow$
MoreDigits :: $\Rightarrow$ Number
Digit :: $\Rightarrow$ 0
Digit :: $\Rightarrow$ 1
Digit :: $\Rightarrow$ 2
Digit :: $\Rightarrow$ 3
Digit :: $\Rightarrow$ 4
Digit :: $\Rightarrow$ 5
Digit :: $\Rightarrow$ 6
Digit :: $\Rightarrow$ 7
Digit :: $\Rightarrow$ 8
Digit :: $\Rightarrow$ 9

Here is the parse tree for a derivation of 37 from Number:

Circular vs. Recursive Definitions. The second rule means we can replace MoreDigits with nothing. This is sometimes written as $\epsilon $ to make it clear that the replacement is empty: MoreDigits ::$\Rightarrow $ $\epsilon $.

This is a very important rule in the grammar—without it no strings could be generated; with it infinitely many strings can be generated. The key is that we can only produce a string when all nonterminals in the string have been replaced with terminals. Without the MoreDigits ::$\Rightarrow $ $\epsilon $ rule, the only rule we would have with MoreDigits on the left side is the third rule: MoreDigits::$\Rightarrow $ Number.

The only rule we have with Number on the left side is the first rule, which replaces Number with Digit MoreDigits. Every time we follow this rule, we replace MoreDigits with Digit MoreDigits. We can produce as many Digits as we want, but without the MoreDigits ::$\Rightarrow $ $\epsilon $ rule we can never stop.

This is the difference between a circular definition, and a recursive definition. Without the stopping rule, MoreDigits would be defined in a circular way. There is no way to start with MoreDigits and generate a production that does not contain MoreDigits (or a nonterminal that eventually must produce MoreDigits). With the MoreDigits ::$\Rightarrow $ $\epsilon $ rule, however, we have a way to produce something terminal from MoreDigits. This is known as a base case — a rule that turns an otherwise circular definition into a meaningful, recursive definition.

Condensed Notation. It is common to have many grammar rules with the same left side nonterminal. For example, the whole numbers grammar has ten rules with Digit on the left side to produce the ten terminal digits. Each of these is an alternative rule that can be used when the production string contains the nonterminal Digit. A compact notation for these types of rules is to use the vertical bar ($\mid $) to separate alternative replacements. For example, we could write the ten Digit rules compactly as:

Digit ::$\Rightarrow $ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Exercise 2.11.  Suppose we replaced the first rule (Number ::$\Rightarrow $ Digit MoreDigits) in the whole numbers grammar with: Number ::$\Rightarrow $ MoreDigits Digit.

  1. How does this change the parse tree for the derivation of 37? Draw the parse tree that results from the new grammar.

  2. Does this change the language? Either show some string that is in the language defined by the modified grammar but not in the original language (or vice versa), or argue that both grammars generate the same strings.

Exercise 2.12.  The grammar for whole numbers we defined allows strings with non-standard leading zeros such as “000” and “00005”. Devise a grammar that produces all whole numbers (including “0”), but no strings with unnecessary leading zeros.

Exercise 2.13.  Define a BNF grammar that describes the language of decimal numbers (the language should include 3.14159, 0.423, and 1120 but not 1.2.3).

Exercise 2.14.  The BNF grammar below (extracted from Paul Mockapetris, Domain Names - Implementation and Specification, IETF RFC 1035) describes the language of domain names on the Internet.

Domain :: $\Rightarrow$ SubDomainList
SubDomainList :: $\Rightarrow$ Label $\mid $ SubDomainList $\mid $ . Label
Label :: $\Rightarrow$ Letter MoreLetters
MoreLetters :: $\Rightarrow$ LetterHyphens LetterDigit $\mid $ $\epsilon$
LetterHyphens :: $\Rightarrow$ LDHyphen $\mid $ LDHyphen LetterHyphens $\mid $ $\epsilon $
LDHyphen :: $\Rightarrow$ LetterDigit LetterDigit $\mid $ -
LetterDigit :: $\Rightarrow$ Letter Digit
Letter :: $\Rightarrow$ A $\mid$ B $\mid$ ... $\mid$ Z $\mid$ a $\mid$ b $\mid$ ... $\mid$ z
Digit :: $\Rightarrow$ 0 $\mid$ 1 $\mid$ 2 $\mid$ 3 $\mid$ 4 $\mid$ 5 $\mid$ 6 $\mid$ 7 $\mid$ 8 $\mid$ 9
  1. Show a derivation for www.virginia.edu in the grammar.

  2. According to the grammar, which of the following are valid domain names: (1) tj, (2) a.-b.c, (3) a-a.b-b.c-c, (4) a.g.r.e.a.t.d.o.m.a.i.n-.

Exploration 2.1: Power of Language Systems

Section 2.4 claimed that recursive transition networks and BNF grammars are equally powerful. What does it mean to say two systems are equally powerful?

A language description mechanism is used to define a set of strings comprising a language. Hence, the power of a language description mechanism is determined by the set of languages it can define.

One approach to measure the power of language description mechanism would be to count the number of languages that it can define. Even the simplest mechanisms can define infinitely many languages, however, so just counting the number of languages does not distinguish well between the different language description mechanisms. Both RTNs and BNFs can describe infinitely many different languages. We can always add a new edge to an RTN to increase the number of strings in the language, or add a new replacement rule to a BNF that replaces a nonterminal with a new terminal symbol.

Instead, we need to consider the set of languages that each mechanism can define. A system $A$ is more powerful that another system $B$ if we can use $A$ to define every language that can be defined by $B$, and there is some language $L$ that can be defined using $A$ that cannot be defined using $B$. This matches our intuitive interpretation of more powerful — $A$ is more powerful than $B$ if it can do everything $B$ can do and more.

The diagrams in Figure 2.6 show three possible scenarios. In the leftmost picture, the set of languages that can be defined by $B$ is a proper subset of the set of languages that can be defined by $A$. Hence, $A$ is more powerful than $B$. In the center picture, the sets are equal. This means every language that can be defined by $A$ can also be defined by $B$, and every language that can be defined by $B$ can also be defined by $A$, and the systems are equally powerful. In the rightmost picture, there are some elements of $A$ that are not elements of $B$, but there are also some elements of $B$ that are not elements of $A$. This means we cannot say either one is more powerful; $A$ can do some things $B$ cannot do, and $B$ can do some things $A$ cannot do.

Figure 2.6: System power relationships.

To determine the relationship between RTNs and BNFs we need to understand if there are languages that can be defined by a BNF that cannot be defined by a RTN and if there are languages that can be defined by a RTN that cannot be defined by an BNF. We will show only the first part of the proof here, and leave the second part as an exercise.

For the first part, we prove that there are no languages that can be defined by a BNF that cannot be defined by an RTN. Equivalently, every language that can be defined by a BNF grammar has a corresponding RTN. Since there are infinitely many languages that can be defined by BNF grammars, we cannot prove this by enumerating each language and showing its corresponding RTN. Instead, we use a proof technique commonly used in computer science: proof by construction. We show an algorithm that given any BNF grammar constructs an RTN that defines the same language as the input BNF grammar.

Our strategy is to construct a subnetwork corresponding to each nonterminal. For each rule where the nonterminal is on the left side, the right hand side is converted to a path through that node’s subnetwork.

Before presenting the general construction algorithm, we illustrate the approach with the example BNF grammar from Example 2.1:

Number :: $\Rightarrow$ Digit MoreDigits
MoreDigits :: $\Rightarrow$ $\epsilon $
MoreDigits :: $\Rightarrow$ Number
Digit :: $\Rightarrow$ 0 $\mid$ 1 $\mid$ 2 $\mid$ 3 $\mid$ 4 $\mid$ 5 $\mid$ 6 $\mid$ 7 $\mid$ 8 $\mid$ 9

The grammar has three nonterminals: Number, Digit, and MoreDigits. For each nonterminal, we construct a subnetwork by first creating two nodes corresponding to the start and end of the subnetwork for the nonterminal. We make StartNumber the start node for the RTN since Number is the starting nonterminal for the grammar.

Next, we need to add edges to the RTN corresponding to the production rules in the grammar. The first rule indicates that Number can be replaced by Digit MoreDigits. To make the corresponding RTN, we need to introduce an intermediate node since each RTN edge can only contain one label. We need to traverse two edges, with labels StartDigit and StartMoreDigits between the StartNumber and EndNumber nodes. The resulting partial RTN is shown in Figure 2.7.

Figure 2.7: Converting the Number productions to an RTN.

For the MoreDigits nonterminal there are two productions. The first means MoreDigits can be replaced with nothing. In an RTN, we cannot have edges with unlabeled outputs. So, the equivalent of outputting nothing is to turn StartMoreDigits into a final node. The second production replaces MoreDigits with Number. We do this in the RTN by adding an edge between StartMoreDigits and EndMoreDigits labeled with Number, as shown in Figure 2.8.

Figure 2.8: Converting the MoreDigits productions to an RTN.

Finally, we convert the ten Digit productions. For each rule, we add an edge between StartDigit and EndDigit labeled with the digit terminal, as shown in Figure 2.9.

Figure 2.9: Converting the Digit productions to an RTN.

This example illustrates that it is possible to convert a particular grammar to an RTN. For a general proof, we present a general an algorithm that can be used to do the same conversion for any BNF:

  1. For each nonterminal X in the grammar, construct two nodes, StartX and EndX, where EndX is a final node. Make the node StartS the start node of the RTN, where S is the start nonterminal of the grammar.

  2. For each rule in the grammar, add a corresponding path through the RTN. All BNF rules have the form X ::$\Rightarrow$ replacement where X is a nonterminal in the grammar and replacement is a sequence of zero or more terminals and nonterminals: $\left[ R_0, R_1, \ldots , R_ n \right]$.

    1. If the replacement is empty, make StartX a final node.

    2. If the replacement has just one element, $R_0$, add an edge from StartX to EndX with edge label $R_0$.

    3. Otherwise:

      1. Add an edge from StartX to a new node labeled $X_{i,0}$ (where $i$ identifies the grammar rule), with edge label $R_0$.

      2. For each remaining element $R_ j$ in the replacement add an edge from $X_{i,j-1}$ to a new node labeled $X_{i,j}$ with edge label $R_ j$. (For example, for element $R_1$, a new node $X_{i,1}$ is added, and an edge from $X_{i,0}$ to $X_{i,1}$ with edge label $R_1$.)

      3. Add an edge from $X_{i,n-1}$ to EndX with edge label $R_ n$.

Following this procedure, we can convert any BNF grammar into an RTN that defines the same language. Hence, we have proved that RTNs are at least as powerful as BNF grammars.

To complete the proof that BNF grammars and RTNs are equally powerful ways of defining languages, we also need to show that a BNF can define every language that can be defined using an RTN. This part of the proof can be done using a similar strategy in reverse: by showing a procedure that can be used to construct a BNF equivalent to any input RTN. We leave the details as an exercise for especially ambitious readers.

Exercise 2.15.  Produce an RTN that defines the same languages as the BNF grammar from Exercise 2.14.

Exercise 2.16.  $\left[\star \right]$Prove that BNF grammars are as powerful as RTNs by devising a procedure that can construct a BNF grammar that defines the same language as any input RTN.