5.5 Matrix formulation

Warning: this is a more advanced optional section and assumes knowledge of matrix algebra.

The multiple regression model can be written as

$$ y_{i} = \beta_{0} + \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \cdots + \beta_{k} x_{k,i} + e_{i}. $$

This expresses the relationship between a single value of the forecast variable and the predictors. It can be convenient to write this in matrix form where all the values of the forecast variable are given in a single equation. Let $\bm{Y} = (y_{1},\dots,y_{N})'$, $\bm{e} = (e_{1},\dots,e_{N})'$, $\bm{\beta} = (\beta_{0},\dots,\beta_{k})'$ and

$$ \bm{X} = \left[\begin{matrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1}\\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2}\\ \vdots & \vdots & \vdots & & \vdots\\ 1 & x_{1,N} & x_{2,N} & \dots & x_{k,N} \end{matrix}\right]. $$

Then [ \bm{Y} = \bm{X}\bm{\beta} + \bm{e}. ]

Least squares estimation

Least squares estimation is obtained by minimizing the expression $(\bm{Y} - \bm{X}\bm{\beta})'(\bm{Y} - \bm{X}\bm{\beta})$. It can be shown that this is minimized when $\bm{\beta}$ takes the value [ \hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} ] This is sometimes known as the "normal equation". The estimated coefficients require the inversion of the matrix $\bm{X}'\bm{X}$. If this matrix is singular, then the model cannot be estimated. This will occur, for example, if you fall for the "dummy variable trap" (having the same number of dummy variables as there are categories of a categorical predictor).

The residual variance is estimated using

$$ \hat{\sigma}^2 = \frac{1}{N-k-1}(\bm{Y} - \bm{X}\hat{\bm{\beta}})'(\bm{Y} - \bm{X}\hat{\bm{\beta}}). $$

Fitted values and cross-validation

The normal equation shows that the fitted values can be calculated using

$$ \bm{\hat{Y}} = \bm{X}\hat{\bm{\beta}} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} = \bm{H}\bm{Y}, $$

where $\bm{H} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}'$ is known as the "hat-matrix" because it is used to compute $\bm{\hat{Y}}$ ("Y-hat").

If the diagonal values of $\bm{H}$ are denoted by $h_{1},\dots,h_{N}$, then the cross-validation statistic can be computed using

$$ \text{CV} = \frac{1}{N}\sum_{i=1}^N [e_{i}/(1-h_{i})]^2, $$

where $e_{i}$ is the residual obtained from fitting the model to all $n$ observations. Thus, it is not necessary to actually fit $N$ separate models when computing the CV statistic.


Let $\bm{X}^*$ be a row vector containing the values of the predictors for the forecasts (in the same format as $\bm{X}$). Then the forecast is given by

$$ \hat{y} = \bm{X}^*\hat{\bm{\beta}} = \bm{X}^*(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{Y} $$

and its variance by

$$ \sigma^2 \left[1 + \bm{X}^* (\bm{X}'\bm{X})^{-1} (\bm{X}^*)'\right]. $$

Then a 95% prediction interval can be calculated (assuming normally distributed errors) as

[ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \bm{X}^* (\bm{X}'\bm{X})^{-1} (\bm{X}^*)'}. ]

This takes account of the uncertainty due to the error term $e$ and the uncertainty in the coefficient estimates. However, it ignores any errors in $\bm{X}^*$. So if the future values of the predictors are uncertain, then the prediction interval calculated using this expression will be too narrow.