2.2 Numerical data summaries

Numerical summaries of data sets are widely used to capture some essential features of the data with a few numbers. A summary number calculated from the data is called a statistic.

Univariate statistics

For a single data set, the most widely used statistics are the average and median.

Suppose $N$ denotes the total number of observations and $x_i$ denotes the $i$th observation. Then the average can be written as1

[ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_{i} = (x_{1} + x_{2} + x_3 + \cdots + x_{N})/N . ]

The average is also called the sample mean.

By way of illustration, consider the carbon footprint from the 20 vehicles listed in Section 1/4. The data listed in order are

4.0 4.4 5.9 5.9 6.1 6.1 6.1 6.3 6.3 6.3
6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.8 6.8 6.8

In this example, $N=20$ and $x_{i}$ denotes the carbon footprint of vehicle $i$. Then the average carbon footprint is

\begin{align} \bar{x} & = \frac{1}{20}\sum_{i=1}^{20} x_{i} \\ &= (x_{1} + x_{2} + x_3 + \dots + x_{20})/20 \\ &= (4.0 + 4.4 + 5.9 + \dots + 6.8 + 6.8 + 6.8)/20 \\ &= 124/20 = 6.2 \text{ tons CO}_{2}. \end{align}

The median, on the other hand, is the middle observation when the data are placed in order. In this case, there are 20 observations and so the median is the average of the 10th and 11th largest observations. That is

[ \text{median} = (6.3+6.6)/2 = 6.45.]

Percentiles are useful for describing the distribution of data. For example, 90% of the data are no larger than the 90th percentile. In the carbon footprint example, the 90th percentile is 6.8 because 90% of the data (18 observations) are less than or equal to 6.8. Similarly, the 75th percentile is 6.6 and the 25th percentile is 6.1. The median is the 50th percentile.

A useful measure of how spread out the data are is the interquartile range or IQR. This is simply the difference between the 75th and 25th percentiles. Thus it contains the middle 50% of the data. For the example,

[ \text{IQR} = (6.6 - 6.1) = 0.5. ]

An alternative and more common measure of spread is the standard deviation. This is given by the formula

[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_{i} - \bar{x})^2}. ]

In the example, the standard deviation is

[ s = \sqrt{\frac{1}{19} \left[ (4.0-6.2)^2 + (4.4 - 6.2)^2 + \cdots + (6.8-6.2)^2\right]} = 0.74. ]

R code
fuel2 <- fuel[fuel$Litres<2,]

Bivariate statistics

The most commonly used bivariate statistic is the correlation coefficient. It measures the strength of the relationship between two variables and can be written as

[ r = \frac{\sum (x_{i} - \bar{x})(y_{i}-\bar{y})}{\sqrt{\sum(x_{i}-\bar{x})^2}\sqrt{\sum(y_{i}-\bar{y})^2}}, ]

where the first variable is denoted by $x$ and the second variable by $y$. The correlation coefficient only measures the strength of the linear relationship; it is possible for two variables to have a strong non-linear relationship but low correlation coefficient. The value of $r$ always lies between -1 and 1 with negative values indicating a negative relationship and positive values indicating a postive relationship.

For example, the correlation between the carbon footprint and city mpg variables shown in Figure 2.5 is -0.97. The value is negative because the carbon footprint decreases as the city mpg increases. While a value of -0.97 is very high, the relationship is even stronger than that number suggests due to its nonlinear nature.

Figure 2.7: Examples of data sets with different levels of correlation.

The graphs in Figure 2.7 show examples of data sets with varying levels of correlation. Those in Figure 2.8 all have correlation coefficients of 0.82, but they have very different shaped relationships. This shows how important it is not to rely only on correlation coefficients but also to look at the plots of the data.

Figure 2.8: Each of these plots has a correlation coefficient of 0.82. Data from Anscombe F. J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21.


Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series. There are several autocorrelation coefficients, depending on the lag length. For example, $r_{1}$ measures the relationship between $y_{t}$ and $y_{t-1}$, $r_{2}$ measures the relationship between $y_{t}$ and $y_{t-2}$, and so on.

Figure 2.9 displays scatterplots of the beer production time series where the horizontal axis shows lagged values of the time series. Each graph shows $y_{t}$ plotted against $y_{t-k}$ for different values of $k$. The autocorrelations are the correlations associated with these scatterplots.

Figure 2.9: Lagged scatterplots for quarterly beer production.

R code
beer2 <- window(ausbeer, start=1992, end=2006-.1)
lag.plot(beer2, lags=9, do.lines=FALSE)

The value of $r_{k}$ can be written as [ r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})}{\sum\limits_{t=1}^T (y_{t}-\bar{y})^2}, ] where $T$ is the length of the time series.

The first nine autocorrelation coefficients for the beer production data are given in the following table.

$r_{1}$ $r_{2}$ $r_3$ $r_4$ $r_5$ $r_6$ $r_7$ $r_8$ $r_9$
-0.126 -0.650 -0.094 0.863 -0.099 -0.642 -0.098 0.834 -0.116

These correspond to the nine scatterplots in the graph above. The autocorrelation coefficients are normally plotted to form the autocorrelation function or ACF. The plot is also known as a correlogram.

Figure 2.10: Autocorrelation function of quarterly beer production

R code

In this graph:

  • $r_{4}$ is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be two quarters apart.
  • $r_{2}$ is more negative than for the other lags because troughs tend to be two quarters behind peaks.

White noise

Time series that show no autocorrelation are called "white noise". Figure 2.11 gives an example of a white noise series.

Figure 2.11: A white noise time series.

R code
x <- ts(rnorm(50))
plot(x, main="White noise")

Figure 2.12: Autocorrelation function for the white noise series.

R code

For white noise series, we expect each autocorrelation to be close to zero. Of course, they are not exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within $\pm 2/\sqrt{T}$ where $T$ is the length of the time series. It is common to plot these bounds on a graph of the ACF. If there are one or more large spikes outside these bounds, or if more than 5% of spikes are outside these bounds, then the series is probably not white noise.

In this example, $T=50$ and so the bounds are at $\pm 2/\sqrt{50} = \pm 0.28$. All autocorrelation coefficients lie within these limits, confirming that the data are white noise.

  1. The $\sum$ indicates that the values of $x_{i}$ are to be summed from $i=1$ to $i=N$.