2.3.1 Frequency tables and distributions

One of the most important pieces of information that can be gleaned from data is the number of times a particular value for each variable was observed. Essentially, this is summarized as a list of all possible outcomes combined with the number of times each occurred (i.e., the frequency of occurrence). This information is then displayed in a table or a barplot. You can think of the table, or, more often, the barplot, as a summary of the distribution of occurrences among all possible values. Often, we use $f_{i}$ to represent the frequency of the $i^{th}$ possible value.

Consider a qualitative variable that lists the graduate degree status of 14 students.

     >  d = c(rep("M.S.", 10), rep("Ph.D.", 4))
     >  table(d)

    d
     M.S. Ph.D.
     10 4

     >  barplot(table(d), ylab = "Frequency")
Degree $f_{i}$
M.S. 10
Ph.D. 4

Figure 2.3: Frequency table and distribution for an example qualitative variable. In this example, the total sample size (i.e., $n$) is 14.

The frequencies are summarized using a barplot in Fig.2.3, and in this frequency distribution, ordering on the x-axis has no meaning. To create the data the function c() was again used, but this time we combined two vectors that were created with a new function. rep() (which stands for repeat) told R to repeat ’M.S.’ ten times and ’Ph.D’ four times. Thus, we were able to create the data vector d by combining two character vectors that were each created using rep(). This illustrates the fact that you can use functions as arguments to other functions, a very useful feature. For the barplot, we actually used as an input the function table(), which created a table of counts based on the vector d.

As an example of ordinal data, consider the distribution of grades within a class (Fig.2.4). Here, the possible values of the variable of interest follow a specific order. As a result, the x-axis follows this ordering.

     >  d = c(rep("A", 10), rep("B", 20), rep("C", 30),
    + rep("D", 5), rep("F", 2))
     >  barplot(table(d), ylab = "Frequency")
Grade $f_{i}$
A 10
B 20
C 30
D 5
F 2

Figure 2.4: Frequency table and distribution for an ordinal variable ($n = 67$).

For discrete data (Fig.2.5), the possible values are lumped together in ranges. The last range indicates that it includes all values greater than 15.

     >  d = c(rep("0-3", 6), rep("4-7", 17), rep("8-11",
    + 30), rep("12-15", 7), rep(" > 15", 1))
     >  barplot(table(d), ylab = "Frequency")
# per plot $f_{i}$
0-3 6
4-7 7
8-11 17
12-15 30
> 15 1

Figure 2.5: Frequency table and distribution for a discrete variable ($n = 61$).

Finally, for a continuous variable (Fig.2.6), the possible values are again put into ranges. This is necessary because, in theory, there are an infinite number of possible values. For continuous data, these ranges are sometimes referred to as bins, and the frequency distribution is referred to as a histogram.

     >  set.seed(1)
     >  Y = rnorm(100) + 9
     >  hist(Y, main = "", xlab = "")

To create the continuous data for Fig.2.6, we actually used a random number generator. rnorm created 100 random values and we added 9 to each of these. set.seed() simply set a random number generator seed value so that we can recreate the same random values in the future if necessary. Here, instead of using a barplot, we used hist() to get the frequency distribution.

Y $f_{i}$
$ Y < 7.00$ 1
$7.00 \geq Y < 7.50$ 3
$7.50 \geq Y < 8.00$ 7
$8.00 \geq Y < 8.50$ 14
$8.50 \geq Y < 9.00$ 21
$9.00 \geq Y < 9.50$ 20
$9.50 \geq Y < 10.00$ 19
$10.00 \geq Y < 10.50$ 9
$10.50 \geq Y < 11.00$ 4
$11.00 \geq Y$ 2

Figure 2.6: Frequency table and distribution for a continuous variable called Y ($n = 100$). In this case, the frequency distribution is also called a histogram.

Before going on, it should be noted that, in many cases, we may choose to use relative frequencies rather than raw frequencies when describing data. The relative frequencies are easily calculated by dividing the $f_{i}$ by $n$, the total number of observations. The importance of using relative frequencies will be discussed further in Chapter 3.