# 7.7 Innovations state space models for exponential smoothing

In the rest of this chapter we study statistical models that underlie the exponential smoothing methods we have considered so far. The exponential smoothing methods presented in Table 7.8 are algorithms that generate point forecasts. The statistical models in this section generate the same point forecasts, but can also generate prediction (or forecast) intervals. A statistical model is a stochastic (or random) data generating process that can produce an entire forecast distribution. The general statistical framework we will introduce also provides a platform for using the model selection criteria introduced in Chapter 5, thus allowing the choice of model to be made in an objective manner.

Each model consists of a measurement equation that describes the observed data and some transition equations that describe how the unobserved components or states (level, trend, seasonal) change over time. Hence these are referred to as “state space models”.

For each method there exist two models: one with additive errors and one with multiplicative errors. The point forecasts produced by the models are identical if they use the same smoothing parameter values. They will, however, generate different prediction intervals.

To
distinguish between a model with additive errors and one with
multiplicative errors (and to also distinguish the models from the
methods) we add a third letter to the classification of
Table 7.7. We label each state
space model as ETS($\cdot,\cdot,\cdot$) for (Error, Trend,
Seasonal). This label can also be thought of as ExponenTial Smoothing.
Using the same notation as in
Table 7.7, the possibilities for
each component are: Error $=\{$A,M$\}$, Trend
$=\{$N,A,A_{d},M,M_{d}$\}$ and Seasonal
$=\{$N,A,M$\}$. Therefore, in total there exist 30 such state
space models: 15 with additive errors and 15 with multiplicative errors.

## ETS(A,N,N): simple exponential smoothing with additive errors

As discussed in Section 7/1, the error correction form of simple exponential smoothing is given by $$\ell_t=\ell_{t-1}+\alpha e_t, $$ where $e_t = y_t - \ell_{t-1}$ and $\pred{y}{t}{t-1} = \ell_{t-1}$. Thus, $e_t = y_t - \pred{y}{t}{t-1}$ represents a one-step forecast error and we can write $y_t = \ell_{t-1} + e_t$.

To make this into an innovations state space model, all we need to do is specify the probability distribution for $e_t$. For a model with additive errors, we assume that one-step forecast errors $e_t$ are normally distributed white noise with mean 0 and variance $\sigma^2$. A short-hand notation for this is $e_t = \varepsilon_t\sim\text{NID}(0,\sigma^2)$; NID stands for “normally and independently distributed”.

Then the equations of the
model can be written \begin{equation} y_t = \ell_{t-1} +
\varepsilon_t \tag{7.4}\label{eq-ann-1a} \end{equation}
\begin{equation} \ell_t=\ell_{t-1}+\alpha \varepsilon_t.
\tag{7.5}\label{eq-ann-2a} \end{equation} We refer to (7.4) as the
*measurement* (or observation) equation and (7.5) as the *state* (or
transition) equation. These two equations, together with the statistical
distribution of the errors, form a fully specified statistical model.
Specifically, these constitute an innovations state space model
underlying simple exponential smoothing.

The term “innovations” comes from the fact that all equations in this type of specification use the same random error process, $\varepsilon_t$. For the same reason this formulation is also referred to as a “single source of error” model in contrast to alternative multiple source of error formulations, which we do not present here.

The measurement equation shows the relationship between the observations and the unobserved states. In this case observation $y_t$ is a linear function of the level $\ell_{t-1}$, the predictable part of $y_t$, and the random error $\varepsilon_t$, the unpredictable part of $y_t$. For other innovations state space models, this relationship may be nonlinear.

The transition equation shows the evolution of the state through time. The influence of the smoothing parameter $\alpha$ is the same as for the methods discussed earlier. For example $\alpha$ governs the degree of change in successive levels. The higher the value of $\alpha$, the more rapid the changes in the level; the lower the value of $\alpha$, the smoother the changes. At the lowest extreme, where $\alpha=0$, the level of the series does not change over time. At the other extreme, where $\alpha=1$, the model reduces to a random walk model, $y_t=y_{t-1}+\varepsilon_t$.

## ETS(M,N,N): simple exponential smoothing with multiplicative errors

In a similar fashion, we can specify models with multiplicative errors by writing the one-step random errors as relative errors:

where $\varepsilon_t \sim \text{NID}(0,\sigma^2)$. Substituting $\pred{y}{t}{t-1}=\ell_{t-1}$ gives $y_t = \ell_{t-1}+\ell_{t-1}\varepsilon_t$ and $e_t = y_t - \pred{y}{t}{t-1} = \ell_{t-1}\varepsilon_t$.

Then we can write the multiplicative form of the state space model as

## ETS(A,A,N): Holt’s linear method with additive errors

For this model, we assume that one-step forecast errors are given by $\varepsilon_t=y_t-\ell_{t-1}-b_{t-1} \sim \text{NID}(0,\sigma^2)$. Substituting this into the error correction equations for Holt’s linear method we obtain

where for simplicity we have set $\beta=\alpha \beta^*$.

## ETS(M,A,N): Holt’s linear method with multiplicative errors

Specifying one-step forecast errors as relative errors such that $$\varepsilon_t=\frac{y_t-(\ell_{t-1}+b_{t-1})}{(\ell_{t-1}+b_{t-1})} $$ and following a similar approach as above, the innovations state space model underlying Holt’s linear method with multiplicative errors is specified as

where again $\beta=\alpha \beta^*$ and $\varepsilon_t \sim \text{NID}(0,\sigma^2)$.

## Other ETS models

In a similar fashion, we can write an innovations state space model for each of the exponential smoothing methods of Table 7.8. Table 7.10 presents the equations for all of the models in the ETS framework.

## Estimating ETS models

Another way to view the parameters is through a
consideration of the mathematical properties of the state space models.
Then the parameters are constrained to prevent observations in the
distant past having a continuing effect on current forecasts. This leads
to some *admissibility* constraints on the parameters which are usually
(but not always) less restrictive than the usual region. For example for
the ETS(A,N,N) model, the usual parameter region is $0< \alpha <1$
but the admissible region is $0< \alpha <2$. For the ETS(A,A,N)
model, the usual parameter region is $0<\alpha<1$ and
$0<\beta<\alpha$ but the admissible region is $0<\alpha<2$
and $0<\beta<4-2\alpha$.

## Model selection

A great advantage of the ETS statistical framework is that information criteria can be used for model selection. The AIC, AIC$_{\text{c}}$ and BIC, introduced in Section 5/3, can be used here to determine which of the 30 ETS models is most appropriate for a given time series.

For ETS models, Akaike’s Information Criterion (AIC) is defined as $$\text{AIC} = -2\log(L) + 2k, $$ where $L$ is the likelihood of the model and $k$ is the total number of parameters and initial states that have been estimated (including the residual variance). The AIC corrected for small sample bias (AIC$\text{c}$) is defined as

and the Bayesian Information Criterion (BIC) is $$\text{BIC} = \text{AIC} + k[\log(T)-2]. $$

Some of the combinations of (Error,
Trend, Seasonal) can lead to numerical difficulties. Specifically, the
models that can cause such instabilities are: ETS(M,M,A),
ETS(M,M_{d},A), ETS(A,N,M), ETS(A,A,M), ETS(A,A_{d},M),
ETS(A,M,N), ETS(A,M,A), ETS(A,M,M), ETS(A,M_{d},N),
ETS(A,M_{d},A), and ETS(A,M_{d},M). We normally do not
consider these particular combinations when selecting a model.

Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values. Therefore multiplicative errors models will not be considered if the time series is not strictly positive. In that case only the six fully additive models will be applied.

## The ets() function in R

The models can be estimated in R using the `ets()`

function in the
forecast package. The R code below shows all the possible arguments this
function takes, and their default values. If only the time series is
specified, and all other arguments are left at their default values,
then an appropriate model will be selected automatically. We explain
each of the arguments below.

gamma=NULL, phi=NULL, additive.only=FALSE, lambda=NULL,

lower=c(rep(0.0001,3), 0.8), upper=c(rep(0.9999,3),0.98),

opt.crit=c("lik","amse","mse","sigma","mae"), nmse=3,

bounds=c("both","usual","admissible"),

ic=c("aicc","aic","bic"), restrict=TRUE)

`y`

- The time series to be forecast.
`model`

- A three-letter code indicating the model to be estimated using the
ETS classification and notation. The possible inputs are “N” for
none, “A” for additive, “M” for multiplicative, or “Z” for automatic
selection. If any of the inputs is left as “Z” then this component
is selected according to the information criterion chosen. The
default value of
`ZZZ`

ensures that all components are selected using the information criterion. `damped`

- If
`damped=TRUE`

, then a damped trend will be used (either A_{d}or M_{d}). If`damped=FALSE`

, then a non-damped trend will used. If`damped=NULL`

(the default), then either a damped or a non-damped trend will be selected according to the information criterion chosen. `alpha, beta, gamma, phi`

- The values of the smoothing parameters can be specified using these
arguments. If they are set to
`NULL`

(the default setting for each of them), the parameters are estimated. `additive.only`

- Only models with additive components will be considered if
`additive.only=TRUE`

. Otherwise all models will be considered. `lambda`

- Box-Cox transformation parameter. It will be ignored if
`lambda=NULL`

(the default value). Otherwise, the time series will be transformed before the model is estimated. When`lambda`

is not`NULL`

,`additive.only`

is set to`TRUE`

. `lower, upper`

- Lower and upper bounds for the parameter estimates $\alpha$, $\beta^*$, $\gamma^*$ and $\phi$.
`opt.crit`

- The optimization criterion to be used for estimation. The default
setting is maximum likelihood estimation, used when
`opt.crit=lik`

. `bounds`

- This specifies the constraints to be used on the parameters. The
traditional constraints are set using
`bounds="usual"`

and the admissible constraints are set using`bounds="admissible"`

. The default (`bounds="both"`

) requires the parameters to satisfy both sets of constraints. `ic`

- The information criterion to be used in selecting models, set by
default to
`aicc`

. `restrict`

- If
`restrict=TRUE`

(the default), the models that cause numerical difficulties are not considered in model selection.

## Forecasting with ETS models

Point forecasts are obtained from the models by iterating the equations for $t=T+1,T+2,\dots,T+h$ and setting all $\varepsilon_t=0$ for $t>T$. For example, for model ETS(M,A,N), $$y_{T+1} = (\ell_T + b_T )(1+ \varepsilon_{T+1}).$$ Therefore

Similarly,

Therefore,

and so on. These forecasts are identical to the forecasts from Holt’s linear method and also those from model ETS(A,A,N). So the point forecasts obtained from the method and from the two models that underlie the method are identical (assuming the same parameter values are used).

A big advantage of the models is that prediction intervals can also be generated — something that cannot be done using the methods. The prediction intervals will differ between models with additive and multiplicative methods.

For some models, there
are exact formulae that enable prediction intervals to be calculated. A
more general approach that works for all models is to simulate future
sample paths, conditional on the last estimate of the states, and to
obtain prediction intervals from the percentiles of these simulated
future paths. These options are available in R using the `forecast`

function in the forecast package. The R code below shows the all the
possible arguments this function takes when applied to an ETS model. We
explain each the arguments in what follows.

level=c(80,95), fan=FALSE, simulate=FALSE, bootstrap=FALSE,

npaths=5000, PI=TRUE, lambda=object$lambda, ...)

`object`

- The object returned by the
`ets()`

function. `h`

- The forecast horizon — the number of periods to be forecast.
`level`

- The confidence level for the prediction intervals.
`fan`

- If
`fan=TRUE`

,`level=seq(50,99,by=1)`

. This is suitable for fan plots. `simulate`

- If
`simulate=TRUE`

, prediction intervals are produced by simulation rather than using algebraic formulae. Simulation will be used even if`simulate=FALSE`

if there are no algebraic formulae available for the particular model. `bootstrap`

- If
`bootstrap=TRUE`

and`simulate=TRUE`

, then the simulated prediction intervals use re-sampled errors rather than normally distributed errors. `npaths`

- The number of sample paths used in computing simulated prediction intervals.
`PI`

- If
`PI=TRUE`

, then prediction intervals are produced; otherwise only point forecasts are calculated. If`PI=FALSE`

, then`level`

,`fan`

,`simulate`

,`bootstrap`

and`npaths`

are all ignored. `lambda`

- The Box-Cox transformation parameter. This is ignored if
`lambda=NULL`

. Otherwise, forecasts are back-transformed via an inverse Box-Cox transformation.

## Example 7.1 Oil production example (revisited)

Figure 7.8 shows the point forecasts and prediction intervals from an estimated ETS(A,N,N) model. The estimates for the smoothing parameter $\alpha$ and the initial level $\ell_0$ are 0.89 and 447.49 respectively, identical to the estimates for the simple exponential smoothing method estimated earlier (see beginning of Example 7.1). The plot shows the importance of generating prediction intervals. The intervals here are relatively wide, so interpreting the point forecasts without accounting for the large uncertainty can be very misleading.

fit <- ets(oildata, model="ANN")

plot(forecast(fit, h=3), ylab="Oil (millions of tones)")

fit$par

0.892 447.489

## Example 7.4 International tourist visitor nights in Australia (revisited)

We now employ the ETS statistical framework to forecast tourists visitor
nights in Australia by international arrivals over the period 2011–2012.
We let the `ets()`

function select the model by minimizing the AICc.

fit <- ets(vndata)

summary(fit)

Call:

ets(y = vndata)

Smoothing parameters:

alpha = 0.4504

beta = 4e-04

gamma = 0.0046

Initial states:

l = 32.4349

b = 0.6533

s=1.0275 0.9463 0.7613 1.2648

sigma: 0.0332

AIC AICc BIC

105.972 115.572 115.397

Training set error measures:

ME RMSE MAE MPE MAPE MASE ACF1

Training set -0.081646 1.33328 1.0647 -0.221243 2.65469 0.374361 -0.089119

The model selected is ETS(M,A,M):

The parameter estimates are $\alpha=0.4504$, $\beta=0.0004$, and $\gamma=0.0046$. The output returns the estimates for the initial states $\ell_0$, $b_0$, $s_{0}$, $s_{-1}$, $s_{-2}$ and $s_{-3}$. The within-sample RMSE for this model is slightly lower than the Holt-Winter methods with additive and multiplicative seasonality presented in Tables 7.5 and 7.6 respectively. Figure 7.9 shows the states over time while Figure 7.10 shows point forecasts and 95% and 80% prediction intervals generated from the model. The intervals this time are much narrower compared to the prediction intervals in the oil production example.

ylab="International visitor night in Australia (millions)")