2.6 Residual diagnostics

A residual in forecasting is the difference between an observed value and its forecast based on other observations: $e_{i} = y_{i}-\hat{y}_{i}$. For time series forecasting, a residual is based on one-step forecasts; that is $\hat{y}_{t}$ is the forecast of $y_{t}$ based on observations $y_{1},\dots,y_{t-1}.$ For cross-sectional forecasting, a residual is based on all other observations; that is $\hat{y}_i$ is the prediction of $y_i$ based on all observations except $y_i$.

A good forecasting method will yield residuals with the following properties:

  • The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.
  • The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased.

Any forecasting method that does not satisfy these properties can be improved. That does not mean that forecasting methods that satisfy these properties can not be improved. It is possible to have several forecasting methods for the same data set, all of which satisfy these properties. Checking these properties is important to see if a method is using all available information well, but it is not a good way for selecting a forecasting method.

If either of these two properties is not satisfied, then the forecasting method can be modified to give better forecasts. Adjusting for bias is easy: if the residuals have mean $m$, then simply add $m$ to all forecasts and the bias problem is solved. Fixing the correlation problem is harder and we will not address it until Chapter 8.

In addition to these essential properties, it is useful (but not necessary) for the residuals to also have the following two properties.

  • The residuals have constant variance.
  • The residuals are normally distributed.

These two properties make the calculation of prediction intervals easier (see the next section for an example). However, a forecasting method that does not satisfy these properties cannot necessarily be improved. Sometimes applying a transformation such as a logarithm or a square root may assist with these properties, but otherwise there is usually little you can do to ensure your residuals have constant variance and have a normal distribution. Instead, an alternative approach to finding prediction intervals is necessary. Again, we will not address how to do this until later in the book.

Example: Forecasting the Dow-Jones Index

For stock market indexes, the best forecasting method is often the naïve method. That is each forecast is simply equal to the last observed value, or $\hat{y}_{t} = y_{t-1}$. Hence, the residuals are simply equal to the difference between consecutive observations: $$ e_{t} = y_{t} - \hat{y}_{t} = y_{t} - y_{t-1}. $$

The following graphs show the Dow Jones Index (DJI), and the residuals obtained from forecasting the DJI with the naïve method.

Figure 2.19: The Dow Jones Index measured daily to 15 July 1994.

Figure 2.20: Residuals from forecasting the Dow Jones Index with the naïve method.

Figure 2.21: Histogram of the residuals from the naïve method applied to the Dow Jones Index. The left tail is a little too long for a normal distribution.

Figure 2.22: ACF of the residuals from the naïve method applied to the Dow Jones Index. The lack of correlation suggests the forecasts are good.

R code
dj2 <- window(dj, end=250)
plot(dj2, main="Dow Jones Index (daily ending 15 Jul 94)",
  ylab="", xlab="Day")
res <- residuals(naive(dj2))
plot(res, main="Residuals from naive method",
  ylab="", xlab="Day")
Acf(res, main="ACF of residuals")
hist(res, nclass="FD", main="Histogram of residuals")

These graphs show that the naïve method produces forecasts that appear to account for all available information. The mean of the residuals is very close to zero and there is no significant correlation in the residuals series. The time plot of the residuals shows that the variation of the residuals stays much the same across the historical data, so the residual variance can be treated as constant. However, the histogram suggests that the residuals may not follow a normal distribution --- the left tail looks a little too long. Consequently, forecasts from this method will probably be quite good but prediction intervals computed assuming a normal distribution may be inaccurate.

Portmanteau tests for autocorrelation

In addition to looking at the ACF plot, we can do a more formal test for autocorrelation by considering a whole set of $r_k$ values as a group, rather than treat each one separately.

Recall that $r_k$ is the autocorrelation for lag $k$. When we look at the ACF plot to see if each spike is within the required limits, we are implicitly carrying out multiple hypothesis tests, each one with a small probability of giving a false positive. When enough of these tests are done, it is likely that at least one will give a false positive and so we may conclude that the residuals have some remaining autocorrelation, when in fact they do not.

In order to overcome this problem, we test whether the first $h$ autocorrelations are significantly different from what would be expected from a white noise process. A test for a group of autocorrelations is called a portmanteau test, from a French word describing a suitcase containing a number of items.

One such test is the Box-Pierce test based on the following statistic $$Q = T \sum_{k=1}^h r_k^2, $$ where $h$ is the maximum lag being considered and $T$ is number of observations. If each $r_k$ is close to zero, then $Q$ will be small. If some $r_k$ values are large (positive or negative), then $Q$ will be large. We suggest using $h=10$ for non-seasonal data and $h=2m$ for seasonal data, where $m$ is the period of seasonality. However, the test is not good when $h$ is large, so if these values are larger than $T/5$, then use $h=T/5$

A related (and more accurate) test is the Ljung-Box test based on [ Q^* = T(T+2) \sum_{k=1}^h (T-k)^{-1}r_k^2. ]

Again, large values of $Q^*$ suggest that the autocorrelations do not come from a white noise series.

How large is too large? If the autocorrelations did come from a white noise series, then both $Q$ and $Q^*$ would have a $\chi^2$ distribution with $(h - K)$ degrees of freedom where $K$ is the number of parameters in the model. If they are calculated from raw data (rather than the residuals from a model), then set $K=0$.

For the Dow-Jones example, the naive model has no parameters, so $K=0$ in that case also. For both $Q$ and $Q^*$, the results are not significant (i.e., the p-values are relatively large). So we can conclude that the residuals are not distinguishable from a white noise series.

R output
# lag=h and fitdf=K
> Box.test(res, lag=10, fitdf=0)
        Box-Pierce test
X-squared = 10.6425, df = 10, p-value = 0.385

> Box.test(res,lag=10, fitdf=0, type="Lj")
        Box-Ljung test
X-squared = 11.0729, df = 10, p-value = 0.3507