7.3.1 Recursive least-squares

In many data analysis tasks data are not available all at once but arrive sequentially. Think for example to data collected from sensors, financial time series or adaptive control applications. In this cases it is useful not to restart from scratch the model estimation but simply to update the model on the basis of the newly collected data. One appealing feature of least-squares estimates is that they can be updated at a lower cost than their batch counterpart.

Let us rewrite the least-squares estimator 7.1.11 for a training set of $N$ samples as:

$$ \hat{\beta }_{(N)}=(X_{(N)}^ T X_{(N)})^{-1}X_{(N)}^ T Y_{(N)} $$

where the subscript $(N)$ is added to denote the number of samples used for the estimation. Suppose that a new sample $\langle x_{N+1},y_{N+1} \rangle $ becomes available. Instead of recomputing the estimate $\hat{\beta }_{(N+1)}$ by using all the $N+1$ available data we want to derive $\hat{\beta }_{(N+1)}$ as an update of $\hat{\beta }_{(N)}$. This problem is solved by the so-called recursive least-squares (RLS) estimation.

If a single new example $\langle x_{N+1},y_{N+1} \rangle $, with $x_{N+1}$ a $[1,p]$ vector, is added to the training set the $X$ matrix acquires a new row and $\hat{\beta }_{(N+1)}$ can be written as:

$$ \hat{\beta }_{(N+1)}=\left( \left[ \begin{array}{c} X_{(N)} \\ x_{N+1} \end{array} \right]^ T \left[ \begin{array}{c} X_{(N)} \\ x_{N+1} \end{array} \right] \right)^{-1} \left[ \begin{array}{c} X_{(N)}\\ x_{N+1} \end{array} \right]^ T \left[ \begin{array}{c} Y_{(N)} \\ y_{N+1} \end{array} \right] $$

By defining the $[p,p]$ matrix

[ S_{(N)}=(X_{(N)}^ T X_{(N)}) ]

we have

\begin{equation} \begin{split} S_{(N+1)}& =(X_{(N+1)}^ T X_{(N+1)})= \left(\left[ X_{(N)}^ T x^ T_{N+1} \right] \left[ \begin{array}{c} X_{(N)} \\ x_{N+1} \end{array} \right] \right) \\ & = \left(X^ T_{(N)}\~ X_{(N)} +x^ T_{N+1} x_{N+1} \right)=S_{(N)}+x^ T_{N+1} x_{N+1} \end{split} \end{equation}

Since

$$ \left[ \begin{array}{c} X_{(N)} \\ x_{N+1} \end{array} \right]^ T \left[ \begin{array}{c} Y_{(N)} \\ y_{N+1} \end{array} \right]=X_{(N)}^ T Y_{(N)} + x^ T_{N+1} y_{N+1} $$

and

$$ S_{(N)} \hat{\beta }_{(N)} =(X_{(N)}^ T X_{(N)}) \left[(X_{(N)}^ T X_{(N)})^{-1} X_{(N)}^ T Y_{(N)}\right]=X_{(N)}^ T Y_{(N)} $$

we obtain

\begin{align*} S_{(N+1)} \hat{\beta }_{(N+1)}= \left[ \begin{array}{c} X_{(N)} \\ x_{N+1} \end{array} \right]^ T \left[ \begin{array}{c} Y_{(N)} \\ y_{N+1} \end{array} \right] & = S_{(N)} \hat{\beta }_{(N)}+x^ T_{N+1}y_{N+1} \\ & = \left( S_{(N+1)}-x^ T_{N+1}x_{N+1} \right) \hat{\beta }_{(N)}+x^ T_{N+1}y_{N+1} \\ & = S_{(N+1)}\hat{\beta }_{(N)}-x^ T_{N+1}x_{N+1}\hat{\beta }_{(N)}+x^ T_{N+1}y_{N+1} \end{align*}

or equivalently

\begin{equation} \hat{\beta }_{(N+1)}=\hat{\beta}_{(N)}+S^{-1}_{(N+1)}x^ T_{N+1}(y_{N+1}-x_{N+1}\hat{\beta}_{(N)}) \end{equation}

1st Recursive formulation

From 7.3.4and 7.3.5 we obtain the following recursive formulation

$ \begin{cases} S_{(N+1)}& =S_{(N)}+x_{N+1}^ T x_{N+1}\\[3pt] \gamma_{(N+1)}& = S_{(N+1)}^{-1} x^ T_{N+1} \\[3pt] e& = y_{N+1}- x_{N+1} \hat{\beta }_{(N)} \\[3pt] \hat{\beta }_{(N+1)}& =\hat{\beta }_{(N)}+ \gamma_{(N+1)} e \\[3pt]\end{cases} $

where the term $\hat{\beta}_{(N+1)}$ can be expressed as a function of the old estimate $\hat{\beta }_{(N)}$ and the new sample $\langle x_{N+1}, y_{N+1} \rangle $. This formulation requires the inversion of the $[p \times p]$ matrix $S_{(N+1)}$.This operation is computationally expensive but, fortunately, using a matrix inversion theorem, an incremental formula for $S^{-1}$ can be found.

2nd Recursive formulation

Once defined

[ V_{(N)}=S_{(N)}^{-1}=(X_{(N)}^ T X_{(N)})^{-1} ]

we have $(S_{(N+1)})^{-1}=(S_{(N)}+x_{N+1}^ T x_{N+1})^{-1}$ and

\begin{align} V_{(N+1)} & = V_{(N)}-V_{(N)} x^ T_{(N+1)} (I+ x_{N+1} V_{(N)} x^ T_{N+1})^{-1} x_{N+1} V_{(N)} \\ & = V_{(N)}-\frac{V_{(N)} x_{N+1}^ T x_{N+1} V_{(N)}}{1+ x_{N+1} V_{(N)} x^ T_{N+1}} \end{align}

From 7.3.6and 7.3.5 we obtain a second recursive formulation:

\begin{equation} \begin{cases} V_{(N+1)}& =V_{(N)}-\frac{V_{(N)} x^ T_{N+1} x_{N+1} V_{(N)}}{1+ x_{N+1} V_{(N)} x^ T_{N+1}}\\[3pt] \gamma_{(N+1)}& = V_{(N+1)} x^ T_{N+1} \\[3pt] e& = y_{N+1}- x_{N+1} \hat{\beta }_{(N)} \\[3pt] \hat{\beta }_{(N+1)}& =\hat{\beta }_{(N)}+ \gamma_{(N+1)} e \\[3pt]\end{cases} \end{equation}

RLS initialization

Both recursive formulations presented above require the initialization values $\hat{\beta}_{(0)}$ and $V_{(0)}$. One way to avoid choosing these initial values is to collect the first $N$ data points, to solve $\hat{\beta}_{(N)}$ and $V_{(N)}$ directly from

\begin{align*} V_{(N)}=(X_{(N)}^T X_{(N)})^{-1} \\ \hat{\beta}_{(N)}=V_{(N)} X_{(N)}^T Y_{(N)} \end{align*}

and to start iterating from the $N+1^{\text{th}}$ point. Otherwise, in case of a generic initialization $ \hat{\beta}_{(0)}$ and $V_{(0)}$ we have the following relations

\begin{align*} V_{(N)}&=(V_{(0)} + X_{(N)}^T X_{(N)} )^{-1}\\ \hat{\beta}_{(N)}&=V_{(N)}(X_{(N)}^T Y_{(N)} + V_{(0)}^{-1} \hat{\beta}_{(0)}) \end{align*}

A common choice is to put

\begin{equation*} V_{(0)}= a I, \qquad a>0 \end{equation*}

Since $V_{(0)}$ represents the variance of the estimator to choose a very large $a$ is equivalent to consider the initial estimation of $\beta $ as very uncertain. By setting $a$ equal to a large number the RLS algorithm will diverge very rapidly from the initialization $\hat{\beta }_{(0)}$. Therefore, we can force the RLS variance and parameters to be arbitrarily close to the ordinary least-squares values, regardless of $\hat{\beta }_{(0)}$.

In any case, in absence of further information, the initial value $\hat{\beta }_{(0)}$ is usually put equal to a zero vector.

RLS with forgetting factor

In some adaptive configurations it can be useful not to give equal importance to all the historical data but to assign higher weights to the most recent data (and then to forget the oldest one). This may happen when the phenomenon underlying the data is non stationary or when we want to approximate a nonlinear dependence by using a linear model which is local in time. Both these situations are common in adaptive control problems.

RLS techniques can deal with these situations by a modification of the formulation 7.3.8 obtained by adding a forgetting factor $\mu < 1$.

$ \begin{cases} V_{(N+1)}& =\frac{1}{\mu } \left(V_{(N)} -\frac{V_{(N)} x^ T_{N+1} x_{N+1} V_{(N)}}{1+ x_{N+1} V_{(N)} x^ T_{N+1}} \right)\\[3pt] \gamma_{(N+1)}& = V_{(N+1)} x^ T_{N+1} \\[3pt] e& = y_{N+1}- x_{N+1} \hat{\beta }_{(N)} \\[3pt] \hat{\beta }_{(N+1)}& =\hat{\beta }_{(N)}+ \gamma_{(N+1)} e \\[3pt]\end{cases} $

Note that: (i) the smaller $\mu $, the higher the forgetting, (ii) for $\mu =1$ we have the conventional RLS formulation.

R script

The R script lin_rls.R considers the RLS fitting of a nonlinear univariate function. The simulation shows how the fitting evolves as long as data $x_i,y_i$, $i=1,\dots ,N$ are collected. Note that the sequence of $x_i$, $i=1,\dots ,N$ is increasing. This means that $\mathbf x$ is not random and that the oldest collected values are the ones with the lowest $x_i$.

The final fitting for a forgetting factor $\mu =0.9$ is shown in Figure 7.6. Note that the linear fitting concerns only the rightmost samples since the values on the left, which are also the oldest ones, are forgotten.

Figure 7.6: RLS fitting of a nonlinear function where the arrival order of data is from left to right

## script lin_rls.R
##
rm(list=ls())
par(ask=TRUE)
n<-1;

rls<-function(x,y,t,P,mu=1){

P.new <-(P-(P%*%x%*%x%*%P)/as.numeric(1+x%*%P%*%x))/mu
ga <- P.new%*%x
epsi <- y-x%*%t
t.new<-t+ga*as.numeric(epsi)
list(t.new,P.new)
}

X<-seq(-pi,pi,by=.02)
N<-length(X)

y<-sin(X)+0.1*rnorm(N)
t<-numeric(2)
P<-500*diag(n+1)
mu<-0.9
for (i in 1:N){
    rls.step<-rls(c(1, X[i]),y[i],t,P,mu)
    t<-rls.step[[1]]
    P<-rls.step[[2]]
    plot(X[1:i],y[1:i],
         xlim=c(-4,4),
         ylim=c(-2,2),
         main=paste("Forgetting factor mu<-",mu))

    lines(X[1:i],cbind(array(1,c(i,1)), X[1:i])%*%t,
         col="red",
         )
}

$\bullet $