5.1 Introduction to multiple linear regression

The general form of a multiple regression is [ y_{i} = \beta_{0} + \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \cdots + \beta_{k} x_{k,i} + e_{i}, ] where $y_{i}$ is the variable to be forecast and $x_{1,i},\dots,x_{k,i}$ are the $k$ predictor variables. Each of the predictor variables must be numerical. The coefficients $\beta_{1},\dots,\beta_{k}$ measure the effect of each predictor after taking account of the effect of all other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.

As for simple linear regression, when forecasting we require the following assumptions for the errors $(e_{1},\dots,e_{N})$:

  • the errors have mean zero;
  • the errors are uncorrelated with each other;
  • the errors are uncorrelated with each predictor $x_{j,i}$.

It is also useful to have the errors normally distributed with constant variance in order to produce prediction intervals, but this is not necessary for forecasting.

Example: credit scores

All 500 observations of the credit score data are shown in the figure below.

Figure 5.2: Scatterplot matrix of the credit scores and the four predictors.

R code
pairs(credit[,-(4:5)], diag.panel=panel.hist)
#The panel.hist function is defined in help(pairs).

The top row shows the relationship between each predictor and credit score. While there appear to be some relationships here, particularly between credit score and income, the predictors are all highly skewed and the few outlying observations are making it hard to see what is going on in the bulk of the data.

A way around this problem is to take transformations of the predictors. However, we cannot take logarithms of the predictors because they contain some zeros. Instead, we use the transformation $\log(x+1)$ where $x$ is the predictor value. That is, we add one to the value of the predictor and then take logarithms. This has a similar effect to taking logarithms but avoids the problem of zeros. It also has the neat side-effect of zeros on the original scale remaining zeros on the transformed scale. The scatterplot matrix showing the transformed data is shown below.

Figure 5.3: Scatterplot matrix of the credit scores and the four predictors, after transforming the four predictors using logarithms.

R code
creditlog <- data.frame(score=credit$score,
  log.savings=log(credit$savings+1),
  log.income=log(credit$income+1),
  log.address=log(credit$time.address+1),
  log.employed=log(credit$time.employed+1),
  fte=credit$fte, single=credit$single)
pairs(creditlog[,1:5],diag.panel=panel.hist)

Now the relationships between the variables are clearer, although there is a great deal of random variation present. We shall try fitting the following model: [ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_3x_3 + \beta_4x_4 + e, ] where

\begin{align*} y & = \text{credit score}, \\ x_{1} & = \text{log savings}, \\ x_{2} & = \text{log income}, \\ x_{3} & = \text{log time at current address},\\ x_4 &= \text{log time in current job},\\ e & = \text{error}. \end{align*}

Here "$\log$" means the transformation $\log(x+1)$.

Estimation of the model

The values of the coefficients $\beta_{0},\dots,\beta_{k}$ are obtained by finding the minimum sum of squares of the errors. That is, we find the values of $\beta_{0},\dots,\beta_{k}$ which minimize [ \sum_{i=1}^N e_{i}^2 = \sum_{i=1}^N (y_{i} - \beta_{0} - \beta_{1}x_{1,i} - \cdots - \beta_{k} x_{k,i})^2. ] This is called "least squares" estimation because it gives the least value of the sum of squared errors. In practice, the calculation is always done using a computer package. Finding the best estimates of the coefficients is often called "fitting" the model to the data.

When we refer to the estimated coefficients, we will use the notation $\hat\beta_{0},\dots,\hat\beta_{k}$. The equations for these will be given in Section 5/5.

Example: credit scores (continued)

The following computer output is obtained after fitting the regression model described earlier.

R code
fit <- step(lm(score ~ log.savings + log.income + log.address
  + log.employed + single, data=creditlog))
summary(fit)
R output
Coefficients:
             Estimate Std. Error t value  P value
(Intercept)    -0.219      5.231   -0.04   0.9667
log.savings    10.353      0.612   16.90  < 2e-16
log.income      5.052      1.258    4.02  6.8e-05
log.address     2.667      0.434    6.14  1.7e-09
log.employed    1.314      0.409    3.21   0.0014

Residual standard error: 10.16 on 495 degrees of freedom
Multiple R-squared: 0.4701, Adjusted R-squared: 0.4658
F-statistic: 109.8 on 4 and 495 DF,  p-value: < 2.2e-16

The first column gives the estimates of each $\beta$ coefficient and the second column gives its "standard error" (i.e., the standard deviation which would be obtained from repeatedly estimating the $\beta$ coefficients on similar data sets). The standard error gives a measure of the uncertainty in the estimated $\beta$ coefficient.

For forecasting purposes, the final two columns are of limited interest. The "t value" is the ratio of a $\beta$ coefficient to its standard error and the last column gives the p-value: the probability of the estimated $\beta$ coefficient being as large as it is if there was no real relationship between the credit score and the predictor. This is useful when studying the effect of each predictor, but is not particularly useful when forecasting.

Fitted values, forecast values and residuals

Predictions of $y$ can be calculated by ignoring the error in the regression equation. That is [ \hat{y} = \hat\beta_{0} + \hat\beta_{1} x_{1} + \hat\beta_{2} x_{2} + \cdots + \hat\beta_{k} x_{k}. ] Plugging in values of $x_{1},\dots,x_{k}$ into the right hand side of this equation gives a prediction of $y$ for that combination of predictors.

When this calculation is done using values of the predictors from the data that were used to estimate the model, we call the resulting values of $\hat{y}$ the "fitted values". These are "predictions" of the data used in estimating the model. They are not genuine forecasts as the actual value of $y$ for that set of predictors was used in estimating the model, and so the value of $\hat{y}$ is affected by the true value of $y$.

When the values of $x_{1},\dots,x_{k}$ are new values (i.e., not part of the data that were used to estimate the model), the resulting value of $\hat{y}$ is a genuine forecast.

The difference between the $y$ observations and the fitted values are the "residuals":

$$ e_i = y_i - \hat{y}_i = y_i - \hat\beta_{0} - \hat\beta_{1} x_{1} - \hat\beta_{2} x_{2} - \cdots - \hat\beta_{k} x_{k}. $$

As with simple regression (see Section 4/2), the residuals have zero mean and are uncorrelated with any of the predictors.

R2: the coefficient of determination

The $R^2$ value was introduced in Section 4/4. It is the square of the correlation between the actual values and the predicted values. The following graph shows the actual values plotted against the fitted values for the credit score data.

Figure 5.4: Actual credit scores plotted against fitted credit scores using the multiple regression model. The correlation is 0.6856, so the squared correlation is (0.6856)2=0.4701.

R code
plot(fitted(fit), creditlog$score,
 ylab="Score", xlab="Predicted score")

Recall that the value of $R^2$ can also be calculated as the proportion of variation in the forecast variable that is explained by the regression model:

$$R^2 = \frac{\sum(\hat{y}_{i} - \bar{y})^2}{\sum(y_{i}-\bar{y})^2} $$

In this case, $R^2=0.47$, so about half of the variation in the scores can be predicted using the model. Thus, the model is not really sufficient to replace a more detailed approach to credit scoring, but it might be helpful in filtering out customers who will get a very low score. In the above graph it is evident that anyone with a predicted score below 35 has a true score not much more than 60. Consequently, if the bank wanted to identify high scoring customers quickly, using this model will provide a first filter, and more detailed information need only be collected on a subset of the customers.