# 2.3.1 Frequency tables and distributions

One of the most important pieces of information that can be gleaned from data is the number of times a particular value for each variable was observed. Essentially, this is summarized as a list of all possible outcomes combined with the number of times each occurred (i.e., the frequency of occurrence). This information is then displayed in a table or a barplot. You can think of the table, or, more often, the barplot, as a summary of the distribution of occurrences among all possible values. Often, we use $f_{i}$ to represent the frequency of the $i^{th}$ possible value.

Consider a qualitative variable that lists the graduate degree status of 14 students.

> table(d)

d

M.S. Ph.D.

10 4

> barplot(table(d), ylab = "Frequency")

Degree | $f_{i}$ |
---|---|

M.S. | 10 |

Ph.D. | 4 |

The frequencies are summarized using a barplot in
Fig.2.3, and in this frequency distribution,
ordering on the x-axis has no meaning. To create the data the function
`c()`

was again used, but this time we combined two vectors that were
created with a new function. `rep()`

(which stands for repeat) told
**R** to repeat ’M.S.’ ten times and ’Ph.D’ four times. Thus, we were
able to create the data vector `d`

by combining two character vectors
that were each created using `rep()`

. This illustrates the fact that you
can use functions as arguments to other functions, a very useful
feature. For the barplot, we actually used as an input the function
`table()`

, which created a table of counts based on the vector `d`

.

As an example of ordinal data, consider the distribution of grades within a class (Fig.2.4). Here, the possible values of the variable of interest follow a specific order. As a result, the x-axis follows this ordering.

+ rep("D", 5), rep("F", 2))

> barplot(table(d), ylab = "Frequency")

Grade | $f_{i}$ |
---|---|

A | 10 |

B | 20 |

C | 30 |

D | 5 |

F | 2 |

For discrete data (Fig.2.5), the possible values are lumped together in ranges. The last range indicates that it includes all values greater than 15.

+ 30), rep("12-15", 7), rep(" > 15", 1))

> barplot(table(d), ylab = "Frequency")

# per plot | $f_{i}$ |
---|---|

0-3 | 6 |

4-7 | 7 |

8-11 | 17 |

12-15 | 30 |

> 15 | 1 |

Finally, for a continuous variable (Fig.2.6),
the possible values are again put into ranges. This is necessary
because, in theory, there are an infinite number of possible values. For
continuous data, these ranges are sometimes referred to as *bins*, and
the frequency distribution is referred to as a *histogram*.

> Y = rnorm(100) + 9

> hist(Y, main = "", xlab = "")

To create the continuous data for Fig.2.6, we
actually used a random number generator. `rnorm`

created 100 random
values and we added 9 to each of these. `set.seed()`

simply set a random
number generator seed value so that we can recreate the same random
values in the future if necessary. Here, instead of using a barplot, we
used `hist()`

to get the frequency distribution.

Y | $f_{i}$ |
---|---|

$ Y < 7.00$ | 1 |

$7.00 \geq Y < 7.50$ | 3 |

$7.50 \geq Y < 8.00$ | 7 |

$8.00 \geq Y < 8.50$ | 14 |

$8.50 \geq Y < 9.00$ | 21 |

$9.00 \geq Y < 9.50$ | 20 |

$9.50 \geq Y < 10.00$ | 19 |

$10.00 \geq Y < 10.50$ | 9 |

$10.50 \geq Y < 11.00$ | 4 |

$11.00 \geq Y$ | 2 |

Before going on, it should be noted that, in many cases, we may choose to use relative frequencies rather than raw frequencies when describing data. The relative frequencies are easily calculated by dividing the $f_{i}$ by $n$, the total number of observations. The importance of using relative frequencies will be discussed further in Chapter 3.