7.2 The PRESS statistic

We mentioned in Section 5.7.1 that cross-validation can provide a reliable estimate of the algorithm generalization error $G_N$. The disadvantage of such an approach is that it requires the training process to be repeated $l$ times, which sometimes means a large computational effort. However, in the case of linear models there exists a powerful statistical procedure to compute the leave-one-out cross-validation measure at a reduced computational cost (Fig.7.5). It is the PRESS (Prediction Sum of Squares) statistic [5], a simple formula which returns the leave-one-out (l-o-o) as a by-product of the parametric identification of $\hat{\beta }$ in Eq. 8.1.39.

Figure 7.5: Leave-one-out for linear models. The leave-one-out error can be computed in two equivalent ways: the slowest way (on the right) which repeats $N$ times the training and the test procedure; the fastest way (on the left) which performs only once the parametric identification and the computation of the PRESS statistic.

Consider a training set $D_N$ in which for $N$ times

  1. we set aside the $j^{\text {th}}$ observation $\langle x_j,y_j \rangle $ from the training set,

  2. we use the remaining $N-1$ observations to estimate the linear regression coefficients $\hat{\beta }^{-j}$,

  3. we use $\hat{\beta }^{-j}$ to predict the target in $x_j$.

The leave-one-out residual is

\begin{equation} e_{j}^{\mathrm{loo}}=y_j-\hat{y}^{-j}_{j}=y_j - x_j^ T \hat{\beta }^{-j} \end{equation}

The PRESS statistic is an efficient way to compute the l-o-o residuals on the basis of the simple regression performed on the whole training set. This allows a fast cross-validation without repeating $N$ times the leave-one-out procedure. The PRESS procedure can be described as follows:

  1. we use the whole training set to estimate the linear regression coefficients $ebe$. This procedure is performed only once on the $N$ samples and returns as by product the Hat matrix (see Section 7.1.13)

    \begin{equation} H=X(X^ TX)^{-1}X^ T \end{equation}

  2. we compute the residual vector $e$, whose $j^\mathrm {th}$ term is $e_j=y_j-x_j^ T \hat{\beta }$,

  3. we use the PRESS statistic to compute $e^{\mathrm{loo}}_j$ as

\begin{equation} e^{\mathrm{loo}}_j=\frac{e_j}{1-H_{jj}} \end{equation}

where $H_{jj}$ is the $j^\mathrm {th}$ diagonal term of the matrix $H$.

Note that 7.2.3 is not an approximation of 7.2.1 but simply a faster way of computing the leave-one-out residual $e^{\mathrm{loo}}_j$.

Let us now derive the formula of the PRESS statistic.

Matrix manipulations show that

\begin{equation} X^ T X-x_j x_j^ T= X_{-j}^ T X_{-j} \end{equation}

where $X_{-j}^ T X_{-j}$ is the $X^ T X$ matrix obtained by putting the $j^{\text {th}}$ row aside.

Using the relation B.5.1 we have

\begin{align*} (X_{-j}^ T X_{-j})^{-1} &=(X^ T X-x_j x_j^T)^{-1} \\ &=(X^ T X)^{-1}+ \frac{(X^ T X)^{-1} x_j x_j^ T (X^ T X)^{-1}}{1-H_{jj}} \end{align*}

and

\begin{align} \hat{\beta }^{-j} &= (X_{-j}^ T X_{-j})^{-1} X_{-j}’y_{-j} \\ &= \left[(X^ T X)^{-1}+\frac{(X^ T X)^{-1} x_j x_j^ T (X^ T X)^{-1}}{1-H_{jj}}\right] X^T_{-j}y_{-j} \end{align}

where $y_{-j}$ is the target vector with the $j^{\text {th}}$ sample set aside.

From 7.2.1and 7.2.6 we have

\begin{equation} \begin{split} e_{j}^{\mathrm{loo}}& =y_j- x_j^ T \hat{\beta }^{-j} \\ & =y_j-x_j^ T \left[ { (X^ T X)^{-1}}+ \frac{ { (X^ T X)^{-1}} x_j x_j^ T { (X^ T X)^{-1}}}{1-H_{jj}} \right] X_{-j}^ T y_{-j} \\ & =y_j- x_j^ T { (X^ T X)^{-1}} X_{-j}^ T y_{-j} -\frac{ H_{jj} x_j^ T { (X^ TX)^{-1}} X_{-j}^ T y_{-j} }{1-H_{jj}} \\ & =\frac{ (1-H_{jj}) y_j - x_j^ T { (X^ TX)^{-1}} X_{-j}^ T { y_{-j}}}{1-H_{jj}} \\ & =\frac{ (1-H_{jj})y_j - x_j^ T { (X^ TX)^{-1}} ( X^ T y - x_j y_j)}{1-H_{jj}} \\ & =\frac{(1-H_{jj})y_j-\hat{y}_j+H_{jj}y_j}{1-H_{jj}} \\ & =\frac{y_j-\hat{y}_j }{1-H_{jj}}=\frac{e_j}{1-H_{jj}} \end{split} \end{equation}

where $X_{-j}^ Ty_{-j}+x_j y_j=X^ Ty$ and $x_j^ T(X^ TX)^{-1} X^ T y=\hat{y}_j$.

Thus, the leave-one-out estimate of the local mean integrated squared error is:

\begin{equation} \hat{G}_{\mathrm{loo}}=\frac{1}{N}\sum_{i=1}^ N \lbrace \frac{y_i-\hat{y}_i}{1-H_{ii}} \rbrace ^2 \end{equation}