7.7 Innovations state space models for exponential smoothing

In the rest of this chapter we study statistical models that underlie the exponential smoothing methods we have considered so far. The exponential smoothing methods presented in Table 7.8 are algorithms that generate point forecasts. The statistical models in this section generate the same point forecasts, but can also generate prediction (or forecast) intervals. A statistical model is a stochastic (or random) data generating process that can produce an entire forecast distribution. The general statistical framework we will introduce also provides a platform for using the model selection criteria introduced in Chapter 5, thus allowing the choice of model to be made in an objective manner.

Each model consists of a measurement equation that describes the observed data and some transition equations that describe how the unobserved components or states (level, trend, seasonal) change over time. Hence these are referred to as “state space models”.

For each method there exist two models: one with additive errors and one with multiplicative errors. The point forecasts produced by the models are identical if they use the same smoothing parameter values. They will, however, generate different prediction intervals.

To distinguish between a model with additive errors and one with multiplicative errors (and to also distinguish the models from the methods) we add a third letter to the classification of Table 7.7. We label each state space model as ETS($\cdot,\cdot,\cdot$) for (Error, Trend, Seasonal). This label can also be thought of as ExponenTial Smoothing. Using the same notation as in Table 7.7, the possibilities for each component are: Error $=\{$A,M$\}$, Trend $=\{$N,A,Ad,M,Md$\}$ and Seasonal $=\{$N,A,M$\}$. Therefore, in total there exist 30 such state space models: 15 with additive errors and 15 with multiplicative errors.

ETS(A,N,N): simple exponential smoothing with additive errors

As discussed in Section 7/1, the error correction form of simple exponential smoothing is given by $$\ell_t=\ell_{t-1}+\alpha e_t, $$ where $e_t = y_t - \ell_{t-1}$ and $\pred{y}{t}{t-1} = \ell_{t-1}$. Thus, $e_t = y_t - \pred{y}{t}{t-1}$ represents a one-step forecast error and we can write $y_t = \ell_{t-1} + e_t$.

To make this into an innovations state space model, all we need to do is specify the probability distribution for $e_t$. For a model with additive errors, we assume that one-step forecast errors $e_t$ are normally distributed white noise with mean 0 and variance $\sigma^2$. A short-hand notation for this is $e_t = \varepsilon_t\sim\text{NID}(0,\sigma^2)$; NID stands for “normally and independently distributed”.

Then the equations of the model can be written \begin{equation} y_t = \ell_{t-1} + \varepsilon_t \tag{7.4}\label{eq-ann-1a} \end{equation} \begin{equation} \ell_t=\ell_{t-1}+\alpha \varepsilon_t. \tag{7.5}\label{eq-ann-2a} \end{equation} We refer to (7.4) as the measurement (or observation) equation and (7.5) as the state (or transition) equation. These two equations, together with the statistical distribution of the errors, form a fully specified statistical model. Specifically, these constitute an innovations state space model underlying simple exponential smoothing.

The term “innovations” comes from the fact that all equations in this type of specification use the same random error process, $\varepsilon_t$. For the same reason this formulation is also referred to as a “single source of error” model in contrast to alternative multiple source of error formulations, which we do not present here.

The measurement equation shows the relationship between the observations and the unobserved states. In this case observation $y_t$ is a linear function of the level $\ell_{t-1}$, the predictable part of $y_t$, and the random error $\varepsilon_t$, the unpredictable part of $y_t$. For other innovations state space models, this relationship may be nonlinear.

The transition equation shows the evolution of the state through time. The influence of the smoothing parameter $\alpha$ is the same as for the methods discussed earlier. For example $\alpha$ governs the degree of change in successive levels. The higher the value of $\alpha$, the more rapid the changes in the level; the lower the value of $\alpha$, the smoother the changes. At the lowest extreme, where $\alpha=0$, the level of the series does not change over time. At the other extreme, where $\alpha=1$, the model reduces to a random walk model, $y_t=y_{t-1}+\varepsilon_t$.

ETS(M,N,N): simple exponential smoothing with multiplicative errors

In a similar fashion, we can specify models with multiplicative errors by writing the one-step random errors as relative errors:

\begin{align*} \varepsilon_t&=\frac{y_t-\pred{y}{t}{t-1}}{\pred{y}{t}{t-1}} \end{align*}

where $\varepsilon_t \sim \text{NID}(0,\sigma^2)$. Substituting $\pred{y}{t}{t-1}=\ell_{t-1}$ gives $y_t = \ell_{t-1}+\ell_{t-1}\varepsilon_t$ and $e_t = y_t - \pred{y}{t}{t-1} = \ell_{t-1}\varepsilon_t$.

Then we can write the multiplicative form of the state space model as

\begin{align*} y_t&=\ell_{t-1}(1+\varepsilon_t)\\ \ell_t&=\ell_{t-1}(1+\alpha \varepsilon_t). \end{align*}

ETS(A,A,N): Holt’s linear method with additive errors

For this model, we assume that one-step forecast errors are given by $\varepsilon_t=y_t-\ell_{t-1}-b_{t-1} \sim \text{NID}(0,\sigma^2)$. Substituting this into the error correction equations for Holt’s linear method we obtain

\begin{align*} y_t&=\ell_{t-1}+b_{t-1}+\varepsilon_t\\ \ell_t&=\ell_{t-1}+b_{t-1}+\alpha \varepsilon_t\\ b_t&=b_{t-1}+\beta \varepsilon_t, \end{align*}

where for simplicity we have set $\beta=\alpha \beta^*$.

ETS(M,A,N): Holt’s linear method with multiplicative errors

Specifying one-step forecast errors as relative errors such that $$\varepsilon_t=\frac{y_t-(\ell_{t-1}+b_{t-1})}{(\ell_{t-1}+b_{t-1})} $$ and following a similar approach as above, the innovations state space model underlying Holt’s linear method with multiplicative errors is specified as

\begin{align*} y_t&=(\ell_{t-1}+b_{t-1})(1+\varepsilon_t)\\ \ell_t&=(\ell_{t-1}+b_{t-1})(1+\alpha \varepsilon_t)\\ b_t&=b_{t-1}+\beta(\ell_{t-1}+b_{t-1}) \varepsilon_t \end{align*}

where again $\beta=\alpha \beta^*$ and $\varepsilon_t \sim \text{NID}(0,\sigma^2)$.

Other ETS models

In a similar fashion, we can write an innovations state space model for each of the exponential smoothing methods of Table 7.8. Table 7.10 presents the equations for all of the models in the ETS framework.

Table 7.10: State space equations for each of the models in the ETS framework. (Click the table for a larger version.)

Estimating ETS models

An alternative to estimating the parameters by minimizing the sum of squared errors, is to maximize the “likelihood”. The likelihood is the probability of the data arising from the specified model. So a large likelihood is associated with a good model. For an additive error model, maximizing the likelihood gives the same results as minimizing the sum of squared errors. However, different results will be obtained for multiplicative error models. In this section, we will estimate the smoothing parameters $\alpha$, $\beta$, $\gamma$ and $\phi$, and the initial states $\ell_0$, $b_0$, $s_0,s_{-1},\dots,s_{-m+1}$, by maximizing the likelihood.
The possible values that the smoothing parameters can take is restricted. Traditionally the parameters have been constrained to lie between 0 and 1 so that the equations can be interpreted as weighted averages. That is, $0< \alpha,\beta^*,\gamma^*,\phi<1$. For the state space models, we have set $\beta=\alpha\beta^*$ and $\gamma=(1-\alpha)\gamma^*$. Therefore the traditional restrictions translate to $0< \alpha <1$,  $0 < \beta < \alpha$  and $0< \gamma < 1-\alpha$. In practice, the damping parameter $\phi$ is usually constrained further to prevent numerical difficulties in estimating the model. A common constraint is to set $0.8 <\phi <0.98$.

Another way to view the parameters is through a consideration of the mathematical properties of the state space models. Then the parameters are constrained to prevent observations in the distant past having a continuing effect on current forecasts. This leads to some admissibility constraints on the parameters which are usually (but not always) less restrictive than the usual region. For example for the ETS(A,N,N) model, the usual parameter region is $0< \alpha <1$ but the admissible region is $0< \alpha <2$. For the ETS(A,A,N) model, the usual parameter region is $0<\alpha<1$ and $0<\beta<\alpha$ but the admissible region is $0<\alpha<2$ and $0<\beta<4-2\alpha$.

Model selection

A great advantage of the ETS statistical framework is that information criteria can be used for model selection. The AIC, AIC$_{\text{c}}$ and BIC, introduced in Section 5/3, can be used here to determine which of the 30 ETS models is most appropriate for a given time series.

For ETS models, Akaike’s Information Criterion (AIC) is defined as $$\text{AIC} = -2\log(L) + 2k, $$ where $L$ is the likelihood of the model and $k$ is the total number of parameters and initial states that have been estimated (including the residual variance). The AIC corrected for small sample bias (AIC$\text{c}$) is defined as

$$\text{AIC}_{\text{c}} = \text{AIC} + \frac{2k(k+1)}{T-k-1}, $$

and the Bayesian Information Criterion (BIC) is $$\text{BIC} = \text{AIC} + k[\log(T)-2]. $$

Some of the combinations of (Error, Trend, Seasonal) can lead to numerical difficulties. Specifically, the models that can cause such instabilities are: ETS(M,M,A), ETS(M,Md,A), ETS(A,N,M), ETS(A,A,M), ETS(A,Ad,M), ETS(A,M,N), ETS(A,M,A), ETS(A,M,M), ETS(A,Md,N), ETS(A,Md,A), and ETS(A,Md,M). We normally do not consider these particular combinations when selecting a model.

Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values. Therefore multiplicative errors models will not be considered if the time series is not strictly positive. In that case only the six fully additive models will be applied.

The ets() function in R

The models can be estimated in R using the ets() function in the forecast package. The R code below shows all the possible arguments this function takes, and their default values. If only the time series is specified, and all other arguments are left at their default values, then an appropriate model will be selected automatically. We explain each of the arguments below.

R code
ets(y, model="ZZZ", damped=NULL, alpha=NULL, beta=NULL,
    gamma=NULL, phi=NULL, additive.only=FALSE, lambda=NULL,
    lower=c(rep(0.0001,3), 0.8), upper=c(rep(0.9999,3),0.98),
    opt.crit=c("lik","amse","mse","sigma","mae"), nmse=3,
    bounds=c("both","usual","admissible"),
    ic=c("aicc","aic","bic"), restrict=TRUE)
y
The time series to be forecast.
model
A three-letter code indicating the model to be estimated using the ETS classification and notation. The possible inputs are “N” for none, “A” for additive, “M” for multiplicative, or “Z” for automatic selection. If any of the inputs is left as “Z” then this component is selected according to the information criterion chosen. The default value of ZZZ ensures that all components are selected using the information criterion.
damped
If damped=TRUE, then a damped trend will be used (either Ad or Md). If damped=FALSE, then a non-damped trend will used. If damped=NULL (the default), then either a damped or a non-damped trend will be selected according to the information criterion chosen.
alpha, beta, gamma, phi
The values of the smoothing parameters can be specified using these arguments. If they are set to NULL (the default setting for each of them), the parameters are estimated.
additive.only
Only models with additive components will be considered if additive.only=TRUE. Otherwise all models will be considered.
lambda
Box-Cox transformation parameter. It will be ignored if lambda=NULL (the default value). Otherwise, the time series will be transformed before the model is estimated. When lambda is not NULL, additive.only is set to TRUE.
lower, upper
Lower and upper bounds for the parameter estimates $\alpha$, $\beta^*$, $\gamma^*$ and $\phi$.
opt.crit
The optimization criterion to be used for estimation. The default setting is maximum likelihood estimation, used when opt.crit=lik.
bounds
This specifies the constraints to be used on the parameters. The traditional constraints are set using bounds="usual" and the admissible constraints are set using bounds="admissible". The default (bounds="both") requires the parameters to satisfy both sets of constraints.
ic
The information criterion to be used in selecting models, set by default to aicc.
restrict
If restrict=TRUE (the default), the models that cause numerical difficulties are not considered in model selection.

Forecasting with ETS models

Point forecasts are obtained from the models by iterating the equations for $t=T+1,T+2,\dots,T+h$ and setting all $\varepsilon_t=0$ for $t>T$. For example, for model ETS(M,A,N), $$y_{T+1} = (\ell_T + b_T )(1+ \varepsilon_{T+1}).$$ Therefore

\begin{align*} \pred{y}{T+1}{T}&=\ell_{T}+b_{T}. \end{align*}

Similarly,

\begin{align*} y_{T+2} &= (\ell_{T+1} + b_{T+1})(1 + \varepsilon_{T+1})\\ &= \left[(\ell_T + b_T)(1+ \alpha\varepsilon_{T+1}) + b_T + \beta (\ell_T + b_T)\varepsilon_{T+1}\right] ( 1 + \varepsilon_{T+1}). \end{align*}

Therefore,

\begin{align*} \pred{y}{T+2}{T}&= \ell_{T}+2b_{T}, \end{align*}

and so on. These forecasts are identical to the forecasts from Holt’s linear method and also those from model ETS(A,A,N). So the point forecasts obtained from the method and from the two models that underlie the method are identical (assuming the same parameter values are used).

A big advantage of the models is that prediction intervals can also be generated — something that cannot be done using the methods. The prediction intervals will differ between models with additive and multiplicative methods.

For some models, there are exact formulae that enable prediction intervals to be calculated. A more general approach that works for all models is to simulate future sample paths, conditional on the last estimate of the states, and to obtain prediction intervals from the percentiles of these simulated future paths. These options are available in R using the forecast function in the forecast package. The R code below shows the all the possible arguments this function takes when applied to an ETS model. We explain each the arguments in what follows.

R code
forecast(object, h=ifelse(object$m>1, 2*object$m, 10),
    level=c(80,95), fan=FALSE, simulate=FALSE, bootstrap=FALSE,
    npaths=5000, PI=TRUE, lambda=object$lambda, ...)
object
The object returned by the ets() function.
h
The forecast horizon — the number of periods to be forecast.
level
The confidence level for the prediction intervals.
fan
If fan=TRUE, level=seq(50,99,by=1). This is suitable for fan plots.
simulate
If simulate=TRUE, prediction intervals are produced by simulation rather than using algebraic formulae. Simulation will be used even if simulate=FALSE if there are no algebraic formulae available for the particular model.
bootstrap
If bootstrap=TRUE and simulate=TRUE, then the simulated prediction intervals use re-sampled errors rather than normally distributed errors.
npaths
The number of sample paths used in computing simulated prediction intervals.
PI
If PI=TRUE, then prediction intervals are produced; otherwise only point forecasts are calculated. If PI=FALSE, then level, fan, simulate, bootstrap and npaths are all ignored.
lambda
The Box-Cox transformation parameter. This is ignored if lambda=NULL. Otherwise, forecasts are back-transformed via an inverse Box-Cox transformation.

Example 7.1    Oil production example (revisited)

Figure 7.8 shows the point forecasts and prediction intervals from an estimated ETS(A,N,N) model. The estimates for the smoothing parameter $\alpha$ and the initial level $\ell_0$ are 0.89 and 447.49 respectively, identical to the estimates for the simple exponential smoothing method estimated earlier (see beginning of Example 7.1). The plot shows the importance of generating prediction intervals. The intervals here are relatively wide, so interpreting the point forecasts without accounting for the large uncertainty can be very misleading.

Figure 7.8: Point forecasts and 80% and 95% prediction intervals from model ETS(A,N,N).

R code
oildata <- window(oil,start=1996,end=2007)
fit <- ets(oildata, model="ANN")
plot(forecast(fit, h=3), ylab="Oil (millions of tones)")
fit$par
R output
  alpha       l
  0.892 447.489

Example 7.4    International tourist visitor nights in Australia (revisited)

We now employ the ETS statistical framework to forecast tourists visitor nights in Australia by international arrivals over the period 2011–2012. We let the ets() function select the model by minimizing the AICc.

R code
vndata <- window(austourists, start=2005)
fit <- ets(vndata)
summary(fit)
R output
ETS(M,A,M)

Call:
 ets(y = vndata)

  Smoothing parameters:
    alpha = 0.4504
    beta  = 4e-04
    gamma = 0.0046

  Initial states:
    l = 32.4349
    b = 0.6533
    s=1.0275 0.9463 0.7613 1.2648

  sigma:  0.0332

    AIC    AICc     BIC
105.972 115.572 115.397

Training set error measures:
                    ME    RMSE    MAE       MPE    MAPE     MASE      ACF1
Training set -0.081646 1.33328 1.0647 -0.221243 2.65469 0.374361 -0.089119

The model selected is ETS(M,A,M):

\begin{align*} y_{t} &= (\ell_{t-1} + b_{t-1})s_{t-m}(1 + \varepsilon_t)\\ \ell_t &= (\ell_{t-1} + b_{t-1})(1 + \alpha \varepsilon_t)\\ b_t &=b_{t-1} + \beta(\ell_{t-1} + b_{t_1})\varepsilon_t\\ s_t &= s_{t-m}(1+ \gamma \varepsilon_t). \end{align*}

The parameter estimates are $\alpha=0.4504$, $\beta=0.0004$, and $\gamma=0.0046$. The output returns the estimates for the initial states $\ell_0$, $b_0$, $s_{0}$, $s_{-1}$, $s_{-2}$ and $s_{-3}$. The within-sample RMSE for this model is slightly lower than the Holt-Winter methods with additive and multiplicative seasonality presented in Tables 7.5 and 7.6 respectively. Figure 7.9 shows the states over time while Figure 7.10 shows point forecasts and 95% and 80% prediction intervals generated from the model. The intervals this time are much narrower compared to the prediction intervals in the oil production example.

Figure 7.9: Graphical representation of the estimated states over time.

R code
plot(fit)

Figure 7.10: Forecasting international visitor nights in Australia from an ETS(M,A,M) model.

R code
plot(forecast(fit,h=8),
  ylab="International visitor night in Australia (millions)")