2.3.3 The boxplot

The frequency distribution provides a useful graphical summary of biological data; however, it can be tedious to work with. In particular, it would be nice to have a graphical summary that captures some detail about a distribution but that also facilitates comparisons among several distributions. The box plot (or boxplot) is just the thing.

The boxplot, sometimes referred to as a box and whisker diagram, was first shown in Fig.1.5. To illustrate its utility, consider the frequency distributions of four different variables (y1, y2, y3, and y4). Each of the distributions is plotted in the left hand panels of Fig.2.7, and their analogous boxplots are shown in the right hand panels.

The boxplots contain several important pieces of information. First, the black line in the middle of the “box” designates the median of the distribution. The median is also referred to as the 50th percentile of the distribution. No more than 50% of the values fall below, and no more than 50% of the values fall above, this value. The limits of the box designate the 25th and 75th percentiles of the distribution. So, no more than 25% of the values fall to the left of the box, and no more than 25% of the values fall to the right of the box. In fact, 50% of all of the observations fall within the limits defined by the box, and this is sometimes referred to as the interquartile range (IQR).

Figure 2.7: Frequency distributions (left hand panels) and boxplots (right hand panels) for four different continuous variables. Here, horizontal boxplots are shown, however, boxplots often are drawn vertically, with the y-axis representing the data values.

The whiskers extend out from the box, and they are meant to capture the range of data. As you can see for y2 and y3, they extend to the minimum and maximum values. However, this is not true for y1 or y4. In both of these distributions, there is a long tail that spreads out from the median of the distribution. Thus, R sets a maximum length to the whiskers at 1.5 times the width of the IQR. Values that fall outside of this are plotted with a point, and might be considered “outliers”. Obviously, this definition of an outlier is completely subjective, and, in fact, there are other criteria for identifying outliers.

Outliers can sometimes wreak havoc on statistical tests, and, in some cases, an investigator may want to remove outliers prior to analysis. However, identifying outliers can be a risky venture. The fact that an observation is an outlier may indicate that it is the result of some error, either in data collection or in data management. In fact, a primary goal in outlier detection is to identify data that may be erroneous. Nevertheless, in the absence of any other evidence, removing data that do not seem to be consistent with the bulk of observations (i.e., they seem to be outliers) can be very misleading and should only be done with extreme caution.

In R, boxplots are easily created using the function boxplot() (take a look at ?boxplot). In addition, as we saw in Fig.1.5, when a continuous variable is plotted as a function of a factor in plot(), the result is a series of boxplots for each level of the factor.