2.2 Numerical data summaries
Numerical summaries of data sets are widely used to capture some essential features of the data with a few numbers. A summary number calculated from the data is called a statistic.
Univariate statistics
For a single data set, the most widely used statistics are the average and median.
Suppose $N$ denotes the total number of observations and $x_i$ denotes the $i$th observation. Then the average can be written as^{1}
[ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_{i} = (x_{1} + x_{2} + x_3 + \cdots + x_{N})/N . ]
The average is also called the sample mean.
By way of illustration, consider the carbon footprint from the 20 vehicles listed in Section 1/4. The data listed in order are
4.0  4.4  5.9  5.9  6.1  6.1  6.1  6.3  6.3  6.3 
6.6  6.6  6.6  6.6  6.6  6.6  6.6  6.8  6.8  6.8 
In this example, $N=20$ and $x_{i}$ denotes the carbon footprint of vehicle $i$. Then the average carbon footprint is
The median, on the other hand, is the middle observation when the data are placed in order. In this case, there are 20 observations and so the median is the average of the 10th and 11th largest observations. That is
[ \text{median} = (6.3+6.6)/2 = 6.45.]
Percentiles are useful for describing the distribution of data. For example, 90% of the data are no larger than the 90th percentile. In the carbon footprint example, the 90th percentile is 6.8 because 90% of the data (18 observations) are less than or equal to 6.8. Similarly, the 75th percentile is 6.6 and the 25th percentile is 6.1. The median is the 50th percentile.
A useful measure of how spread out the data are is the interquartile range or IQR. This is simply the difference between the 75th and 25th percentiles. Thus it contains the middle 50% of the data. For the example,
[ \text{IQR} = (6.6  6.1) = 0.5. ]
An alternative and more common measure of spread is the standard deviation. This is given by the formula
[ s = \sqrt{\frac{1}{N1} \sum_{i=1}^N (x_{i}  \bar{x})^2}. ]
In the example, the standard deviation is
[ s = \sqrt{\frac{1}{19} \left[ (4.06.2)^2 + (4.4  6.2)^2 + \cdots + (6.86.2)^2\right]} = 0.74. ]
summary(fuel2[,"Carbon"])
sd(fuel2[,"Carbon"])
Bivariate statistics
The most commonly used bivariate statistic is the correlation coefficient. It measures the strength of the relationship between two variables and can be written as
[ r = \frac{\sum (x_{i}  \bar{x})(y_{i}\bar{y})}{\sqrt{\sum(x_{i}\bar{x})^2}\sqrt{\sum(y_{i}\bar{y})^2}}, ]
where the first variable is denoted by $x$ and the second variable by $y$. The correlation coefficient only measures the strength of the linear relationship; it is possible for two variables to have a strong nonlinear relationship but low correlation coefficient. The value of $r$ always lies between 1 and 1 with negative values indicating a negative relationship and positive values indicating a postive relationship.
For example, the correlation between the carbon footprint and city mpg variables shown in Figure 2.5 is 0.97. The value is negative because the carbon footprint decreases as the city mpg increases. While a value of 0.97 is very high, the relationship is even stronger than that number suggests due to its nonlinear nature.
The graphs in Figure 2.7 show examples of data sets with varying levels of correlation. Those in Figure 2.8 all have correlation coefficients of 0.82, but they have very different shaped relationships. This shows how important it is not to rely only on correlation coefficients but also to look at the plots of the data.
Autocorrelation
Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series. There are several autocorrelation coefficients, depending on the lag length. For example, $r_{1}$ measures the relationship between $y_{t}$ and $y_{t1}$, $r_{2}$ measures the relationship between $y_{t}$ and $y_{t2}$, and so on.
Figure 2.9 displays scatterplots of the beer production time series where the horizontal axis shows lagged values of the time series. Each graph shows $y_{t}$ plotted against $y_{tk}$ for different values of $k$. The autocorrelations are the correlations associated with these scatterplots.
lag.plot(beer2, lags=9, do.lines=FALSE)
The value of $r_{k}$ can be written as [ r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}\bar{y})(y_{tk}\bar{y})}{\sum\limits_{t=1}^T (y_{t}\bar{y})^2}, ] where $T$ is the length of the time series.
The first nine autocorrelation coefficients for the beer production data are given in the following table.
$r_{1}$  $r_{2}$  $r_3$  $r_4$  $r_5$  $r_6$  $r_7$  $r_8$  $r_9$ 

0.126  0.650  0.094  0.863  0.099  0.642  0.098  0.834  0.116 
These correspond to the nine scatterplots in the graph above. The autocorrelation coefficients are normally plotted to form the autocorrelation function or ACF. The plot is also known as a correlogram.
In this graph:
 $r_{4}$ is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be two quarters apart.
 $r_{2}$ is more negative than for the other lags because troughs tend to be two quarters behind peaks.
White noise
Time series that show no autocorrelation are called "white noise". Figure 2.11 gives an example of a white noise series.
x < ts(rnorm(50))
plot(x, main="White noise")
For white noise series, we expect each autocorrelation to be close to zero. Of course, they are not exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within $\pm 2/\sqrt{T}$ where $T$ is the length of the time series. It is common to plot these bounds on a graph of the ACF. If there are one or more large spikes outside these bounds, or if more than 5% of spikes are outside these bounds, then the series is probably not white noise.
In this example, $T=50$ and so the bounds are at $\pm 2/\sqrt{50} = \pm 0.28$. All autocorrelation coefficients lie within these limits, confirming that the data are white noise.

The $\sum$ indicates that the values of $x_{i}$ are to be summed from $i=1$ to $i=N$. ↩