5.4 Residual diagnostics

The residuals from a regression model are calculated as the difference between the actual values and the fitted values: $e_{i} = y_{i}-\hat{y}_{i}$. Each residual is the unpredictable component of the associated observation.

After selecting the regression variables and fitting a regression model, it is necessary to plot the residuals to check that the assumptions of the model have been satisfied. There are a series of plots that should be produced in order to check different aspects of the fitted model and the underlying assumptions.

Scatterplots of residuals against predictors

Do a scatterplot of the residuals against each predictor in the model. If these scatterplots show a pattern, then the relationship may be nonlinear and the model will need to be modified accordingly. See Section 5/6 for a discussion of nonlinear regression.

Figure 5.8: The residuals from the regression model for credit scores plotted against each of the predictors in the model.

R code
fit <- lm(score ~ log.savings + log.income +
 log.address + log.employed, data=creditlog)
par(mfrow=c(2,2))
plot(creditlog$log.savings,residuals(fit),xlab="log(savings)")
plot(creditlog$log.income,residuals(fit),xlab="log(income)")
plot(creditlog$log.address,residuals(fit),xlab="log(address)")
plot(creditlog$log.employed,residuals(fit),xlab="log(employed)")

It is also necessary to plot the residuals against any predictors not in the model. If these show a pattern, then the predictor may need to be added to the model (possibly in a nonlinear form).

Figure 5.8 shows the residuals from the model fitted to credit scores. In this case, the scatterplots show no obvious patterns, although the residuals tend to be negative for large values of the savings predictor. This suggests that the credit scores tend to be over-estimated for people with large amounts of savings. To correct this bias, we would need to use a non-linear model (see Section 5/6).

Scatterplot of residuals against fitted values

A plot of the residuals against the fitted values should show no pattern. If a pattern is observed, there may be "heteroscedasticity" in the errors. That is, the variance of the residuals may not be constant. To overcome this problem, a transformation of the forecast variable (such as a logarithm or square root) may be required.

The following graph shows a plot of the residuals against the fitted values for the credit score model.

Figure 5.9: The residuals from the credit score model plotted against the fitted values obtained from the model.

R code
plot(fitted(fit), residuals(fit),
 xlab="Predicted scores", ylab="Residuals")

Again, the plot shows no systematic patterns and the variation in the residuals does not seem to change with the size of the fitted value.

Autocorrelation in the residuals

When the data are a time series, you should look at an ACF plot of the residuals. This will reveal if there is any autocorrelation in the residuals (suggesting that there is information that has not been accounted for in the model).

The following figure shows a time plot and ACF of the residuals from the model fitted to the beer production data discussed in Section 5/2.

Figure 5.10: Residuals from the regression model for beer production.

R code
fit <- tslm(beer2 ~ trend + season)
res <- residuals(fit)
par(mfrow=c(1,2))
plot(res, ylab="Residuals",xlab="Year")
Acf(res, main="ACF of residuals")

There is an outlier in the residuals (2004:Q4) which suggests there was something unusual happening in that quarter. It would be worth investigating that outlier to see if there were any unusual circumstances or events that may have reduced beer production for the quarter.

The remaining residuals show that the model has captured the patterns in the data quite well, although there is a small amount of autocorrelation left in the residuals (seen in the significant spike in the ACF plot). This suggests that the model can be slightly improved, although it is unlikely to make much difference to the resulting forecasts.

Another test of autocorrelation that is designed to take account of the regression model is the Durbin-Watson test. It is used to test the hypothesis that there is no lag one autocorrelation in the residuals. If there is no autocorrelation, the Durbin-Watson distribution is symmetric around 2. Most computer packages will report the DW statistic automatically, and should also provide a p-value. A small p-value indicates there is significant autocorrelation remaining in the residuals. For the beer model, the Durbin-Watson test reveals some significant lag one autocorrelation.

R code
dwtest(fit, alt="two.sided")
# It is recommended that the two-sided test always be used
# to check for negative as well as positive autocorrelation
R output
    Durbin-Watson test
DW = 2.5951, p-value = 0.02764

Both the ACF plot and the Durbin-Watson test show that there is some autocorrelation remaining in the residuals. This means there is some information remaining in the residuals that can be exploited to obtain better forecasts. The forecasts from the current model are still unbiased, but will have larger prediction intervals than they need to. A better model in this case will be a dynamic-regression model which will be covered in Chapter 9.

A third possible test is the Breusch-Godfrey test designed to look for significant higher-lag autocorrelations.

R code
# Test for autocorrelations up to lag 5.
bgtest(fit,5)

Histogram of residuals

Finally, it is a good idea to check if the residuals are normally distributed. As explained earlier, this is not essential for forecasting, but it does make the calculation of prediction intervals much easier.

Figure 5.11: Histogram of residuals from regression model for beer production.

R code
hist(res, breaks="FD", xlab="Residuals",
 main="Histogram of residuals", ylim=c(0,22))
x <- -50:50
lines(x, 560*dnorm(x,0,sd(res)),col=2)

In this case, the residuals seem to be slightly negatively skewed, although that is probably due to the outlier.