1.2.2 Representing data

We can use sequences of bits to represent many kinds of data. All we need to do is think of the right binary questions for which the bits give answers that allow us to represent each possible value. Next, we provide examples showing how bits can be used to represent numbers, text, and pictures.

Numbers. In the previous section, we identified a number using a tree where each node asks a binary question and the branches correspond to the “Yes” and “No” answers. A more compact way of writing down our decisions following the tree is to use 0 to encode a “No” answer, and 1 to encode a “Yes” answer and describe a path to a leaf by a sequence of 0s and 1s—the “No”, “No”, “No” path to 0 is encoded as 000, and the “Yes”, “No”, “Yes” path to 5 is encoded as 101. This is known as the binary number system. Whereas the decimal number system uses ten as its base (there are ten decimal digits, and the positional values increase as powers of ten), the binary system uses two as its base (there are two binary digits, and the positional values increase as powers of two).

There are only 10 types of people in the world: those who understand binary, and those who don’t.
Infamous T-Shirt

For example, the binary number 10010110 represents the decimal value 150:

Binary: 1 0 0 1 0 1 1 0
Value: $2^7$ $2^6$ $2^5$ $2^4$ $2^3$ $2^2$ $2^1$ $2^0$
Decimal Value: 128 64 32 16 8 4 2 1

As in the decimal number system, the value of each binary digit depends on its position.

By using more bits, we can represent larger numbers. With enough bits, we can represent any natural number this way. The more bits we have, the larger the set of possible numbers we can represent. As we saw with the binary decision trees, $n$ bits can be used to represent $2^{n}$ different numbers.

Discrete Values.discrete We can use a finite sequence of bits to describe any value that is selected from a countable set of possible values. A set is countable if there is a way to assign a unique natural number to each element of the set. All finite sets are countable. Some, but not all, infinite sets are countable. For example, there appear to be more integers than there are natural numbers since for each natural number, $n$, there are two corresponding integers, $n$ and $-n$. But, the integers are in fact countable. We can enumerate the integers as: $0, 1, -1, 2, -2, 3, -3, 4, -4, \ldots $ and assign a unique natural number to each integer in turn.

Other sets, such as the real numbers, are uncountable. Georg Cantor proved this using a technique known as diagonalization. Suppose the real numbers are enumerable. This means we could list all the real numbers in order, so we could assign a unique integer to each number. For example, considering just the real numbers between 0 and 1, our enumeration might be:

1 $.00000000000000\ldots $
2 $.25000000000000\ldots $
3 $.33333333333333\ldots $
4 $.6666666666666\ldots $
$\cdots $ $\cdots $
57236 $.141592653589793\ldots $
$\cdots $ $\cdots $

Cantor proved by contradiction that there is no way to enumerate all the real numbers. The trick is to produce a new real number that is not part of the enumeration. We can do this by constructing a number whose first digit is different from the first digit of the first number, whose second digit is different from the second digit of the second number, etc. For the example enumeration above, we might choose $.1468\ldots $.

The $k^{th}$ digit of the constructed number is different from the $k^{th}$ digit of the number $k$ in the enumeration. Since the constructed number differs in at least one digit from every enumerated number, it does not match any of the enumerated numbers exactly. Thus, there is a real number that is not included in the enumeration list, and it is impossible to enumerate all the real numbers.1

Digital computers 2 operate on inputs that are discrete values. Continuous values, such as real numbers, can only be approximated by computers. Next, we consider how two types of data, text and images, can be represented by computers. The first type, text, is discrete and can be represented exactly; images are continuous, and can only be represented approximately.

Text. The set of all possible sequences of characters is countable. One way to see this is to observe that we could give each possible text fragment a unique number, and then use that number to identify the item. For example we could enumerate all texts alphabetically by length (here, we limit the characters to lowercase letters): a, b, c, $\ldots $, z, aa, ab, $\ldots $, az, ba, $\ldots $, zz, aaa, $\ldots $

Since we have seen that we can represent all the natural numbers with a sequence of bits, so once we have the mapping between each item in the set and a unique natural number, we can represent all of the items in the set. For the representation to be useful, though, we usually need a way to construct the corresponding number for any item directly.

So, instead of enumerating a mapping between all possible character sequences and the natural numbers, we need a process for converting any text to a unique number that represents that text. Suppose we limit our text to characters in the standard English alphabet. If we include lower-case letters (26), upper-case letters (26), and punctuation (space, comma, period, newline, semi-colon), we have 57 different symbols to represent. We can assign a unique number to each symbol, and encode the corresponding number with six bits (this leaves seven values unused since six bits can distinguish 64 values). For example, we could encode using the mapping shown in Table 1.1. The first bit answers the question: “Is it an uppercase letter after F or a special character?”. When the first bit is 0, the second bit answers the question: “Is it after p?”.

a 000000
b 000001
c 000010
d 000011
$\cdots $ $\cdots $
p 001111
q 010000
$\cdots $ $\cdots $
z 011001

A 011010
B 011011
C 011100
$\cdots $ $\cdots $
F 011111
G 100000
$\cdots $ $\cdots $
Y 110010
Z 110011
space 110100
, 110101
. 110110
newline 110111
; 111000
unused 111001
$\cdots $ $\cdots $
unused 111110
unused 111111

Table 1.1: Encoding characters using bits.

This is one way to encode the alphabet, but not the one typically used by computers. One commonly used encoding known as ASCII (the American Standard Code for Information Interchange) uses seven bits so that 128 different symbols can be encoded. The extra symbols are used to encode more special characters.

Once we have a way of mapping each individual letter to a fixed-length bit sequence, we could write down any sequence of letters by just concatenating the bits encoding each letter. So, the text CS is encoded as 011100 101100. We could write down text of length $n$ that is written in the 57-symbol alphabet using this encoding using $6n$ bits. To convert the number back into text, just invert the mapping by replacing each group of six bits with the corresponding letter.

Rich Data. We can also use bit sequences to represent complex data like pictures, movies, and audio recordings. First, consider a simple black and white picture:

Since the picture is divided into discrete squares known as pixels, we could encode this as a sequence of bits by using one bit to encode the color of each pixel (for example, using 1 to represent black, and 0 to represent white). This image is 16x16, so has 256 pixels total. We could represent the image using a sequence of 256 bits (starting from the top left corner):

$\cdots $

What about complex pictures that are not divided into discrete squares or a fixed number of colors, like Van Gogh’s Starry Night?

Different wavelengths of electromagnetic radiation have different colors. For example, light with wavelengths between 625 and 730 nanometers appears red. But, each wavelength of light has a slightly different color; for example, light with wavelength 650 nanometers would be a different color (albeit imperceptible to humans) from light of wavelength $650.0000001$ nanometers. There are arguably infinitely many different colors, corresponding to different wavelengths of visible light.3 Since the colors are continuous and not discrete, there is no way to map each color to a unique, finite bit sequence.

On the other hand, the human eye and brain have limits. We cannot actually perceive infinitely many different colors; at some point the wavelengths are close enough that we cannot distinguish them. Ability to distinguish colors varies, but most humans can perceive only a few million different colors. The set of colors that can be distinguished by a typical human is finite; any finite set is countable, so we can map each distinguishable color to a unique bit sequence.

A common way to represent color is to break it into its three primary components (red, green, and blue), and record the intensity of each component. The more bits available to represent a color, the more different colors that can be represented.

Thus, we can represent a picture by recording the approximate color at each point. If space in the universe is continuous, there are infinitely many points. But, as with color, once the points get smaller than a certain size they are imperceptible. We can approximate the picture by dividing the canvas into small regions and sampling the average color of each region. The smaller the sample regions, the more bits we will have and the more detail that will be visible in the image. With enough bits to represent color, and enough sample points, we can represent any image as a sequence of bits.

Summary. We can use sequences of bits to represent any natural number exactly, and hence, represent any member of a countable set using a sequence of bits. The more bits we use the more different values that can be represented; with $n$ bits we can represent $2^{n}$ different values.

We can also use sequences of bits to represent rich data like images, audio, and video. Since the world we are trying to represent is continuous there are infinitely many possible values, and we cannot represent these objects exactly with any finite sequence of bits. However, since human perception is limited, with enough bits we can represent any of these adequately well. Finding ways to represent data that are both efficient and easy to manipulate and interpret is a constant challenge in computing. Manipulating sequences of bits is awkward, so we need ways of thinking about bit-level representations of data at higher levels of abstraction.Chapter 5 focuses on ways to manage complex data.

  1. Alert readers should be worried that this isn’t quite correct since the resulting number may be a different way to represent the same real number (for example, $.1999999999999\ldots = .20000000000\ldots $ even though they differ in each digit). This technical problem can be fixed by placing some restrictions on how the modified digits are chosen to avoid infinite repetitions. 

  2. This is, indeed, part of the definition of a digital computer. An analog computer operates on continuous values. In Chapter 6, we explain more of the inner workings of a computer and why nearly all computers today are digital. We use computer to mean a digital computer in this book. The property that there are more real numbers than natural numbers has important implications for what can and cannot be computed, which we return to in Chapter 12

  3. Whether there are actually infinitely many different colors comes down to the question of whether the space-time of the universe is continuous or discrete. Certainly in our common perception it seems to be continuous—we can imagine dividing any length into two shorter lengths. In reality, this may not be the case at extremely tiny scales. It is not known if time can continue to be subdivided below $10^{-40}$ of a second.