4.4 Evaluating the regression model

Residual plots

Recall that each residual is the unpredictable random component of each observation and is defined as $$e_i=y_i-\hat{y}_i, $$ for $i=1,\ldots,N$. We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way for a first check is to examine a scatterplot of the residuals against the predictor variable.

Figure 4.4: Residual plot from regressing carbon footprint versus fuel economy in city driving conditions.

R code
res <- residuals(fit)
plot(jitter(res)~jitter(City), ylab="Residuals", xlab="City", data=fuel)
abline(0,0)

A non-random pattern may indicate that a non-linear relationship may be required, or some heteroscedasticity is present (i.e., the residuals show non-constant variance), or there is some left over serial correlation (only when the data are time series).

Figure 4.4 shows that the residuals from the Car data example display a pattern rather than being randomly scattered. Residuals corresponding to the lowest of the City values are mainly positive, then City values between 20 and 30, the residuals are mainly negative, and for larger City values (above 30 mpg) the residuals are positive again. This suggests the simple linear model is not appropriate for these data. Instead, a non-linear model will be required.

Outliers and influential observations

Observations that take on extreme values compared to the majority of the data are called “outliers”. Observations that have a large influence on the estimation results of a regression model are called “influential observations”. Usually, influential observations are also outliers that are extreme in the $x$ direction.

Example 4.2 Predicting weight from height

In Figure 4.5 we consider simple linear models for predicting the weight of 7 year old children by regressing weight against height. The two samples considered are identical except for the one observation that is an outlier. In the first row of plots the outlier is a child who weighs 35kg and is 120cm tall. In the second row of plots the outlier is a child who also weighs 35kg but is much taller at 150cm (so more extreme in the $x$ direction). The black lines are the estimated regression lines when the outlier in each case is not included in the sample. The red lines are the estimated regression lines when including the outliers. Both outliers have an effect on the regression line, but the second has a much bigger effect — so we call it an influential observation. The residual plots show that an influential observation does not always lead to a large residual.

Figure 4.5: The effect of outliers and influential observations on regression.

There are formal methods for detecting outliers and influential observations that are beyond the scope of this textbook. As we suggested at the beginning of Chapter 2, getting familiar with your data prior to performing any analysis is of vital importance. A scatter plot of $y$ against $x$ is always a useful starting point in regression analysis and often helps to identify unusual observations.

One source for an outlier occurring is an incorrect data entry. Simple descriptive statistics of your data can identify minima and maxima that are not sensible. If such an observation is identified, and it has been incorrectly recorded, it should be immediately removed from the sample.

Outliers also occur when some observations are simply different. In this case it may not be wise for these observations to be removed. If an observation has been identified as a likely outlier it is important to study it and analyze the possible reasons behind it. The decision of removing or retaining such an observation can be a challenging one (especially when outliers are influential observations). It is wise to report results both with and without the removal of such observations.

Goodness-of-fit

A common way to summarize how well a linear regression model fits the data is via the coefficient of determination or $R^2$. This can be calculated as the square of the correlation between the observed $y$ values and the predicted $\hat{y}$ values. Alternatively, it can also be calculated as:

$$ R^2 = \frac{\sum(\hat{y}_{i} - \bar{y})^2}{\sum(y_{i}-\bar{y})^2}, $$

where the summations are over all observations. Thus, it is also the proportion of variation in the forecast variable that is accounted for (or explained) by the regression model.

If the predictions are close to the actual values, we would expect $R^2$ to be close to 1. On the other hand, if the predictions are unrelated to the actual values, then $R^2=0$. In all cases, $R^2$ lies between 0 and 1.

In simple linear regression, the value of $R^2$ is also equal to the square of the correlation between $y$ and $x$. In the car data example $r=-0.91$, hence $R^2=0.82$. The coefficient of determination is presented as part of the R output obtained when estimation a linear regression and is labelled “Multiple R-squared: 0.8244” in Figure 4.3. Thus, 82% of the variation in the carbon footprint of cars is captured by the model. However, a “high” $R^2$ does not always indicate a good model for forecasting. Figure 4.4 shows that there are specific ranges of values of $y$ for which the fitted $y$ values are systematically under- or over-estimated.

The $R^2$ value is commonly used, often incorrectly, in forecasting. There are no set rules of what a good $R^2$ value is and typical values of $R^2$ depend on the type of data used. Validating a model’s out-of-sample forecasting performance is much better than measuring the in-sample $R^2$ value.

Standard error of the regression

Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “standard error of the regression” and is calculated by \begin{equation}\label{eq-4-se}\tag{4.1} s_e=\sqrt{\frac{1}{N-2}\sum_{i=1}^{N}{e_i^2}}. \end{equation} Notice that this calculation is slightly different from the usual standard deviation where we divide by $N-1$ (see Section 2/2). Here, we divide by $N-2$ because we have estimated two parameters (the intercept and slope) in computing the residuals. Normally, we only need to estimate the mean (i.e., one parameter) when computing a standard deviation. The divisor is always $N$ minus the number of parameters estimated in the calculation.

The standard error is related to the size of the average error that the model produces. We can compare this error to the sample mean of $y$ or with the standard deviation of $y$ to gain some perspective on the accuracy of the model. In Figure 4.3, $s_e$ is part of the R output labeled “Residual standard error” and takes the value 0.4703 tonnes per year.

We should warn here that the evaluation of the standard error can be highly subjective as it is scale dependent. The main reason we introduce it here is that it is required when generating forecast intervals, discussed in Section 4/5.