# 2.4 Replacement grammars

Another way to define a language is to use a grammar. This is the most common way languages are defined by computer scientists today, and the way we will use for the rest of this book.

A *grammar* is a set of rules for generating all strings in the
language. We use the *Backus-Naur Form* (BNF) notation to define a
grammar. BNF grammars are exactly as powerful as
recursive transition networks (Exploration 2.1 explains what
this means and why it is the case), but easier to write down.

BNF was invented by John Backus in the late 1950s. Backus led efforts at IBM to define and implement Fortran, the first widely used programming language. Fortran enabled computer programs to be written in a language more like familiar algebraic formulas than low-level machine instructions, enabling programs to be written more quickly. In defining the Fortran language, Backus and his team used ad hoc English descriptions to define the language.

These ad hoc descriptions were often misinterpreted, motivating the need for a more precise way of defining a language.

Rules in a Backus-Naur Form grammar have the form:

*nonterminal* ::$\Rightarrow$ *replacement*

*I flunked out every year. I never studied. I hated studying. I was just
goofing around. It had the delightful consequence that every year I went
to summer school in New Hampshire where I spent the summer sailing and
having a nice time.* John Backus

The left side of a rule is always a single symbol, known as a
*nonterminal* since it can never appear in the final generated string.
The right side of a rule contains one or more symbols. These symbols may
include nonterminals, which will be replaced using replacement rules
before generating the final string. They may also be *terminals*, which
are output symbols that never appear as the left side of a rule. When we
describe grammars, we use *italics* to represent nonterminal symbols,
and **bold** to represent terminal symbols. The terminals are the
primitives in the language; the grammar rules are its means of
combination.

We can generate a string in the language described by a replacement
grammar by starting from a designated start symbol (e.g., *sentence*),
and at each step selecting a nonterminal in the working string, and
replacing it with the right side of a replacement rule whose left side
matches the nonterminal. Wherever we find a nonterminal on the left side
of a rule, we can replace it with what appears on the right side of any
rule where that nonterminal matches the left side. A string is generated
once there are no nonterminals remaining.

Here is an example BNF grammar (that describes the same language as the RTN in Figure 2.1):

1. | Sentence | :: $\Rightarrow$ | Noun Verb |

2. | Noun | :: $\Rightarrow$ | Alice |

3. | Noun | :: $\Rightarrow$ | Bob |

4. | Verb | :: $\Rightarrow$ | Jumps |

5. | Verb | :: $\Rightarrow$ | runs |

Starting from *Sentence*, the grammar can generate four sentences:
“Alice jumps”, “Alice runs”, “Bob jumps”, and “Bob runs”.

A *derivation* shows how a grammar generates a given string. Here is the
derivation of “Alice runs”:

Sentence | :: $\Rightarrow$ | $\underline{\text{Noun}}$ Verb | using Rule 1 | |

:: $\Rightarrow$ | Alice Verb |
replacing Noun using Rule 2 |
||

:: $\Rightarrow$ | Alice runs |
replacing Verb using Rule 5 |

We can represent a grammar derivation as a tree, where the root of the
tree is the starting nonterminal (*Sentence* in this case), and the
leaves of the tree are the terminals that form the derived sentence.
Such a tree is known as a *parse tree*. Here is the parse tree for the
derivation of “Alice runs”:

BNF grammars can be more compact than just listing strings in the
language since a grammar can have many replacements for each
nonterminal. For example, adding the rule, *Noun* ::$\Rightarrow
$ **Colleen**, to the grammar adds two new strings (“Colleen runs” and
“Colleen jumps”) to the language.

**Recursive Grammars.**recursive grammarThe real power of BNF as a
compact notation for describing languages, though, comes once we start
adding recursive rules to our grammar. A grammar is recursive if the
grammar contains a nonterminal that can produce a production that
contains itself.

Suppose we add the rule,

*Sentence* ::$\Rightarrow $ *Sentence* **and** *Sentence*

to our example grammar. Now, how many sentences can we generate?

Infinitely many! This grammar describes the same language as the RTN in Figure 2.2. It can generate "Alice runs and Bob jumps" and "Alice runs and Bob jumps and Alice runs" and sentences with any number of repetitions of "Alice runs". This is very powerful: by using recursive rules a compact grammar can be used to define a language containing infinitely many strings.

This grammar defines the language of the whole numbers (0, 1,$\ldots $) with leading zeros allowed:

Number | :: $\Rightarrow$ | Digit MoreDigits |

MoreDigits | :: $\Rightarrow$ | |

MoreDigits | :: $\Rightarrow$ | Number |

Digit | :: $\Rightarrow$ | 0 |

Digit | :: $\Rightarrow$ | 1 |

Digit | :: $\Rightarrow$ | 2 |

Digit | :: $\Rightarrow$ | 3 |

Digit | :: $\Rightarrow$ | 4 |

Digit | :: $\Rightarrow$ | 5 |

Digit | :: $\Rightarrow$ | 6 |

Digit | :: $\Rightarrow$ | 7 |

Digit | :: $\Rightarrow$ | 8 |

Digit | :: $\Rightarrow$ | 9 |

Here is the parse tree for a derivation of **37** from *Number*:

**Circular vs. Recursive Definitions.** The second rule means we can
replace *MoreDigits* with nothing. This is sometimes written as
$\epsilon $ to make it clear that the replacement is empty:
*MoreDigits* ::$\Rightarrow $ $\epsilon $.

This is a very important rule in the grammar—without it *no* strings
could be generated; with it *infinitely* many strings can be generated.
The key is that we can only produce a string when all nonterminals in
the string have been replaced with terminals. Without the *MoreDigits*
::$\Rightarrow $ $\epsilon $ rule, the only rule we would have
with *MoreDigits* on the left side is the third rule: *MoreDigits*::$\Rightarrow $ *Number*.

The only rule we have with *Number* on the left side is the first rule,
which replaces *Number* with *Digit* *MoreDigits*. Every time we follow
this rule, we replace *MoreDigits* with *Digit* *MoreDigits*. We can
produce as many *Digit*s as we want, but without the *MoreDigits*
::$\Rightarrow $ $\epsilon $ rule we can never stop.

This is the difference between a *circular* definition, and a
*recursive* definition. Without the stopping rule, *MoreDigits* would be
defined in a circular way. There is no way to start with *MoreDigits*
and generate a production that does not contain *MoreDigits* (or a
nonterminal that eventually must produce *MoreDigits*). With the
*MoreDigits* ::$\Rightarrow $ $\epsilon $ rule, however, we have a
way to produce something terminal from *MoreDigits*. This is known as a
*base case* — a rule that turns an otherwise circular definition into a
meaningful, recursive definition.

**Condensed Notation.** It is common to have many grammar rules with the
same left side nonterminal. For example, the whole numbers grammar has
ten rules with *Digit* on the left side to produce the ten terminal
digits. Each of these is an alternative rule that can be used when the
production string contains the nonterminal *Digit*. A compact notation
for these types of rules is to use the vertical bar ($\mid $) to
separate alternative replacements. For example, we could write the ten
*Digit* rules compactly as:

Digit ::$\Rightarrow $ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

**Exercise 2.11. ** Suppose we replaced the first rule (*Number*
::$\Rightarrow $ *Digit* *MoreDigits*) in the whole numbers grammar
with: *Number* ::$\Rightarrow $ *MoreDigits* *Digit*.

How does this change the parse tree for the derivation of

**37**? Draw the parse tree that results from the new grammar.Does this change the language? Either show some string that is in the language defined by the modified grammar but not in the original language (or vice versa), or argue that both grammars generate the same strings.

**Exercise 2.12. ** The grammar for whole numbers we defined allows
strings with non-standard leading zeros such as “000” and “00005”.
Devise a grammar that produces all whole numbers (including “0”), but no
strings with unnecessary leading zeros.

**Exercise 2.13. ** Define a BNF grammar that describes the language of
decimal numbers (the language should include 3.14159, 0.423, and 1120
but not 1.2.3).

**Exercise 2.14. ** The BNF grammar below (extracted from Paul
Mockapetris, *Domain Names - Implementation and Specification*, IETF RFC
1035) describes the language of domain names on the Internet.

Domain | :: $\Rightarrow$ | SubDomainList |

SubDomainList | :: $\Rightarrow$ | Label $\mid $ SubDomainList $\mid $ . Label |

Label | :: $\Rightarrow$ | Letter MoreLetters |

MoreLetters | :: $\Rightarrow$ | LetterHyphens LetterDigit $\mid $ $\epsilon$ |

LetterHyphens | :: $\Rightarrow$ | LDHyphen $\mid $ LDHyphen LetterHyphens $\mid $ $\epsilon $ |

LDHyphen | :: $\Rightarrow$ | LetterDigit LetterDigit $\mid $ - |

LetterDigit | :: $\Rightarrow$ | Letter Digit |

Letter | :: $\Rightarrow$ | A $\mid$ B $\mid$ ... $\mid$ Z $\mid$ a $\mid$ b $\mid$ ... $\mid$ z |

Digit | :: $\Rightarrow$ | 0 $\mid$ 1 $\mid$ 2 $\mid$ 3 $\mid$ 4 $\mid$ 5 $\mid$ 6 $\mid$ 7 $\mid$ 8 $\mid$ 9 |

Show a derivation for

**www.virginia.edu**in the grammar.According to the grammar, which of the following are valid domain names: (1)

**tj**, (2)**a.-b.c**, (3)**a-a.b-b.c-c**, (4)**a.g.r.e.a.t.d.o.m.a.i.n-**.

Section 2.4 claimed that recursive transition networks and BNF grammars are equally powerful. What does it mean to say two systems are equally powerful?

A language description mechanism is used to define a set of strings comprising a language. Hence, the power of a language description mechanism is determined by the set of languages it can define.

One approach to measure the power of language description mechanism would be to count the number of languages that it can define. Even the simplest mechanisms can define infinitely many languages, however, so just counting the number of languages does not distinguish well between the different language description mechanisms. Both RTNs and BNFs can describe infinitely many different languages. We can always add a new edge to an RTN to increase the number of strings in the language, or add a new replacement rule to a BNF that replaces a nonterminal with a new terminal symbol.

Instead, we need to consider the set of languages that each mechanism
can define. A system $A$ is more powerful that another system $B$ if
we can use $A$ to define every language that can be defined by $B$,
and there is some language $L$ that can be defined using $A$ that
cannot be defined using $B$. This matches our intuitive interpretation
of *more powerful* — $A$ is more powerful than $B$ if it can do
everything $B$ can do and more.

The diagrams in Figure 2.6 show three possible scenarios. In the leftmost picture, the set of languages that can be defined by $B$ is a proper subset of the set of languages that can be defined by $A$. Hence, $A$ is more powerful than $B$. In the center picture, the sets are equal. This means every language that can be defined by $A$ can also be defined by $B$, and every language that can be defined by $B$ can also be defined by $A$, and the systems are equally powerful. In the rightmost picture, there are some elements of $A$ that are not elements of $B$, but there are also some elements of $B$ that are not elements of $A$. This means we cannot say either one is more powerful; $A$ can do some things $B$ cannot do, and $B$ can do some things $A$ cannot do.

To determine the relationship between RTNs and BNFs we need to understand if there are languages that can be defined by a BNF that cannot be defined by a RTN and if there are languages that can be defined by a RTN that cannot be defined by an BNF. We will show only the first part of the proof here, and leave the second part as an exercise.

For the first part, we prove that there are no languages that can be
defined by a BNF that cannot be defined by an RTN. Equivalently, *every*
language that can be defined by a BNF grammar has a corresponding RTN.
Since there are infinitely many languages that can be defined by BNF
grammars, we cannot prove this by enumerating each language and showing
its corresponding RTN. Instead, we use a proof technique commonly used
in computer science: *proof by construction*. We show an algorithm that
given any BNF grammar constructs an RTN that defines the same language
as the input BNF grammar.

Our strategy is to construct a subnetwork corresponding to each nonterminal. For each rule where the nonterminal is on the left side, the right hand side is converted to a path through that node’s subnetwork.

Before presenting the general construction algorithm, we illustrate the approach with the example BNF grammar from Example 2.1:

Number | :: $\Rightarrow$ | Digit MoreDigits |

MoreDigits | :: $\Rightarrow$ | $\epsilon $ |

MoreDigits | :: $\Rightarrow$ | Number |

Digit | :: $\Rightarrow$ | 0 $\mid$ 1 $\mid$ 2 $\mid$ 3 $\mid$ 4 $\mid$ 5 $\mid$ 6 $\mid$ 7 $\mid$ 8 $\mid$ 9 |

The grammar has three nonterminals: *Number*, *Digit*, and *MoreDigits*.
For each nonterminal, we construct a subnetwork by first creating two
nodes corresponding to the start and end of the subnetwork for the
nonterminal. We make *StartNumber* the start node for the RTN since
*Number* is the starting nonterminal for the grammar.

Next, we need to add edges to the RTN corresponding to the production
rules in the grammar. The first rule indicates that *Number* can be
replaced by *Digit* *MoreDigits*. To make the corresponding RTN, we need
to introduce an intermediate node since each RTN edge can only contain
one label. We need to traverse two edges, with labels *StartDigit* and
*StartMoreDigits* between the *StartNumber* and *EndNumber* nodes. The
resulting partial RTN is shown in Figure 2.7.

For the *MoreDigits* nonterminal there are two productions. The first
means *MoreDigits* can be replaced with nothing. In an RTN, we cannot
have edges with unlabeled outputs. So, the equivalent of outputting
nothing is to turn *StartMoreDigits* into a final node. The second
production replaces *MoreDigits* with *Number*. We do this in the RTN by
adding an edge between *StartMoreDigits* and *EndMoreDigits* labeled
with *Number*, as shown in Figure 2.8.

Finally, we convert the ten *Digit* productions. For each rule, we add
an edge between *StartDigit* and *EndDigit* labeled with the digit
terminal, as shown in Figure 2.9.

This example illustrates that it is possible to convert a particular grammar to an RTN. For a general proof, we present a general an algorithm that can be used to do the same conversion for any BNF:

For each nonterminal

*X*in the grammar, construct two nodes,*StartX*and*EndX*, where*EndX*is a final node. Make the node*StartS*the start node of the RTN, where*S*is the start nonterminal of the grammar.For each rule in the grammar, add a corresponding path through the RTN. All BNF rules have the form

**X**::$\Rightarrow$*replacement*where*X*is a nonterminal in the grammar and*replacement*is a sequence of zero or more terminals and nonterminals: $\left[ R_0, R_1, \ldots , R_ n \right]$.If the replacement is empty, make

*StartX*a final node.If the replacement has just one element, $R_0$, add an edge from

*StartX*to*EndX*with edge label $R_0$.Otherwise:

Add an edge from

*StartX*to a new node labeled $X_{i,0}$ (where $i$ identifies the grammar rule), with edge label $R_0$.For each remaining element $R_ j$ in the replacement add an edge from $X_{i,j-1}$ to a new node labeled $X_{i,j}$ with edge label $R_ j$. (For example, for element $R_1$, a new node $X_{i,1}$ is added, and an edge from $X_{i,0}$ to $X_{i,1}$ with edge label $R_1$.)

Add an edge from $X_{i,n-1}$ to

*EndX*with edge label $R_ n$.

Following this procedure, we can convert any BNF grammar into an RTN that defines the same language. Hence, we have proved that RTNs are at least as powerful as BNF grammars.

To complete the proof that BNF grammars and RTNs are equally powerful ways of defining languages, we also need to show that a BNF can define every language that can be defined using an RTN. This part of the proof can be done using a similar strategy in reverse: by showing a procedure that can be used to construct a BNF equivalent to any input RTN. We leave the details as an exercise for especially ambitious readers.

**Exercise 2.15. ** Produce an RTN that defines the same languages as
the BNF grammar from Exercise 2.14.

**Exercise 2.16. ** $\left[\star \right]$Prove that BNF grammars
are as powerful as RTNs by devising a procedure that can construct a BNF
grammar that defines the same language as any input RTN.