9.4 Forecasting hierarchical or grouped time series

Warning: this is a more advanced section and assumes knowledge of matrix algebra.

Time series can often be naturally disaggregated in a hierarchical structure using attributes such as geographical location, product type, etc. For example, the total number of bicycles sold by a cycling warehouse can be disaggregated into a hierarchy of bicycle types. Such a warehouse will sell road bikes, mountain bikes, children bikes or hybrids. Each of these can be disaggregated into finer categories. Children’s bikes can be divided into balance bikes for children under 4 years old, single speed bikes for children between 4 and 6 and bikes for children over the age of 6. Hybrid bikes can be divided into city, commuting, comfort, and trekking bikes; and so on. Such disaggregation imposes a hierarchical structure. We refer to these as hierarchical time series.

Another possibility is that series can be naturally grouped together based on attributes without necessarily imposing a hierarchical structure. For example the bicycles sold by the warehouse can be for males, females or unisex. They can be used for racing, commuting or recreational purposes. They can be single speed or have multiple gears. Frames can be carbon, aluminium or steel. Grouped time series can be thought of as hierarchical time series that do not impose a unique hierarchical structure in the sense that the order by which the series can be grouped is not unique.

In this section we present alternative approaches for forecasting time series data that possess hierarchical structures. Figure 9.12 shows a $K=2$-level hierarchy.1 At the top of the hierarchy, level 0, is the “Total”, the most aggregate level of the data. We denote as $y_ t$ the $t$th observation of the “Total” series for $t=1,\dots ,T$. Below this level we denote as $y_{\text{j},t}$ the $t$th observation of the series which corresponds to node $j$ of the hierarchical tree. The “Total” is disaggregated into 2 series at level 1 and each of these into 3 and 2 series respectively at the bottom level of the hierarchy, level 2. The total number of series in a hierarchy is given by $n=1+n_1+\dots +n_ K$ where $n_ i$ is the number of series at level $i$ of the hierarchy. In this case $n=1+2+5=8$.

Figure 9.12: A two level hierarchical tree diagram.

For any time $t$, the observations of the bottom level series will aggregate to the observations of the series above. This can be effectively represented using matrix notation. We construct an $n\times n_ K$ matrix referred to as the “summing” matrix $\bm{S} $ which dictates how the bottom level series are aggregated, consistent with the hierarchical structure. For the hierarchy in Figure 9.12 we can write

$$ \newcommand{\y}[2]{y_{#1,#2}} \left[ \begin{array}{c} y_{t} \\ \y{A}{t} \\ \y{B}{t} \\ \y{AA}{t} \\ \y{AB}{t} \\ \y{AC}{t} \\ \y{BA}{t} \\ \y{BB}{t} \end{array} \right] = \left[\begin{array}{ccccc} 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} \y{AA}{t} \\ \y{AB}{t} \\ \y{AC}{t} \\ \y{BA}{t} \\ \y{BB}{t} \end{array} \right] $$

or in more compact notation

$$ \bm{y}_ t=\bm{S} \bm{y}_{K,t}, $$

where $\bm{y}_ t$ is a vector of all the observations in the hierarchy at time $t$, $\bm{S} $ is the summing matrix as defined above, and $\bm{y}_{K,t}$ is a vector of all the observation in the bottom level of the hierarchy at time $t$.

We are interested in generating forecasts for each series in the hierarchy. We denote as $\hat{y}_{\text{j},h}$ the $h$-step-ahead forecast generated for the series at node $j$ having observed the time series up to observation $T$ and as $\hat{y}_{h}$ the $h$-step-ahead forecast generated for the “Total” series.2 We refer to these as “base” forecasts. They are independent forecasts generated for each series in the hierarchy using a suitable forecasting method presented in earlier sections of this book. These base forecasts are then combined to produce final forecasts for the whole hierarchy that aggregate in a manner that is consistent with the structure of the hierarchy. We refer to these as revised forecasts and denote them as $\tilde{y}_{\text{j},h}$ and $\tilde{y}_{h}$ respectively.

There are a number of ways of combining the base forecasts in order to obtain revised forecasts. The following sections discuss some of the possible combining approaches.

The bottom-up approach

A commonly applied method for hierarchical forecasting is the bottom-up approach. This approach involves first generating base independent forecasts for each series at the bottom level of the hierarchy and then aggregating these upwards to produce revised forecasts for the whole hierarchy.

For example, for the hierarchy of Figure 9.12 we first generate $h$-step-ahead base forecasts for the bottom level series: $\hat{y}_{\text {AA},h}$, $\hat{y}_{\text {AB},h}$, $\hat{y}_{\text {AC},h}$, $\hat{y}_{\text {BA},h}$ and $\hat{y}_{\text {BB},h}$. Aggregating these up the hierarchy we get $h$-step-ahead forecasts for the rest of the series: $\tilde{y}_{\text {A},h}= \hat{y}_{\text {AA},h}+\hat{y}_{\text {AB},h}+\hat{y}_{\text {AC},h}$, $\tilde{y}_{\text {B},h}= \hat{y}_{\text {BA},h}+\hat{y}_{\text {BB},h}$ and $\tilde{y}_{h}=\tilde{y}_{\text {A},h}+\tilde{y}_{\text {B},h}$. Note that for the bottom-up approach the revised forecasts for the bottom level series are equal to the base forecasts.

Using matrix notation we can again employ the summing matrix and write

$$ \tilde{\bm{y} }_{h}=\bm{S} \hat{\bm{y} }_{K,h}. $$

The greatest advantage of this approach is that no information is lost due to aggregation. On the other hand bottom level data can be quite noisy and more challenging to model and forecast.

Top-down approaches

Top-down approaches involve first generating base forecasts for the “Total” series $y_ t$ on the top of the hierarchy and then disaggregating these downwards. We let $ p_1,\dots ,p_{n_ K} $ be a set of proportions which dictate how the base forecasts of the “Total” series are to be distributed to revised forecasts for each series at the bottom level of the hierarchy. Once the bottom level forecasts have been generated we can use the summing matrix to generate forecasts for the rest of the series in the hierarchy. Note that for top-down approaches the top level revised forecasts are equal to the top level base forecasts, i.e., $\tilde{y}_{h}=\hat{y}_{h}$.

The most common top-down approaches specify proportions based on the historical proportions of the data. The two most common versions follow.

Average historical proportions

$$p_j = \frac{1}{T}\sum_{t=1}^T \frac{y_{j,t}}{y_t}$$ for $j=1,\dots ,n_ K$. Each proportion $p_ j$ reflects the average of the historical proportions of the bottom level series $y_{j,t}$ over the period $t=1,\dots ,T$ relative to the total aggregate $y_ t$.

Proportions of the historical averages

[ p_ j={\sum_{t=1}^{T}\frac{y_{j,t}}{T}}/{\sum_{t=1}^{T}\frac{y_ t}{T}} ] for $j=1,\dots ,n_ K$. Each proportion $p_ j$ captures the average historical value of the bottom level series $y_{j,t}$ relative to the average value of the total aggregate $y_ t$.

The greatest attribute of such top-down approaches is their simplicity to apply. One only needs to model and generate forecasts for the most aggregated top level series. In general these approaches seem to produce quite reliable forecasts for the aggregate levels and they are very useful with low count data. On the other hand, their greatest disadvantage is the loss of information due to aggregation. With these top-down approaches, we are unable to capture and take advantage of individual series characteristics such as time dynamics, special events, etc.

In the example on forecasting Australian domestic tourism demand that follows, the data are highly seasonal. The seasonal pattern of tourist arrivals may vary across series depending on the tourism destination. An area with beaches as its main tourist attractions will have a very different seasonal pattern to an area with skiing as its main tourist attraction. This will not be captured by disaggregating the total of these destinations based on historical proportions. Finally, with these methods the disaggregation of the “Total” series forecasts depends on historical and static proportions, and these proportions may be distorted by trends in the data.

Forecasted proportions

An alternative approach that improves on the historical and static nature of the proportions specified above is to use forecasted proportions.

To demonstrate the intuition of this method, consider a one level hierarchy. We first generate $h$-step-ahead base forecasts for all the series independently. At level 1 we calculate the proportion of each $h$-step-ahead base forecast to the aggregate of all the $h$-step-ahead base forecasts at this level. We refer to these as the forecasted proportions and we use these to disaggregate the top level forecast and generate revised forecasts for the whole of the hierarchy.

For a $K$-level hierarchy this process is repeated for each node going from the top to the very bottom level. Applying this process leads to the following general rule for obtaining the forecasted proportions

$$ p_ j=\prod^{K-1}_{\ell =0}\frac{\hat{y}_{j,h}^{(\ell )}}{\hat{S}_{j,h}^{(\ell +1)}} $$

for $j=1,2,\dots ,n_ K$. These forecasted proportions disaggregate the $h$-step-ahead base forecast of the “Total” series to $h$-step-ahead revised forecasts of the bottom level series. $\hat{y}_{j,h}^{(\ell )}$ is the $h$-step-ahead base forecast of the series that corresponds to the node which is $\ell $ levels above $j$. $\hat{S}_{j,h}^{(\ell )}$ is the sum of the $h$-step-ahead base forecasts below the node that is $\ell $ levels above node $j$ and are directly connected to that node.

We will use the hierarchy of Figure 9.12 to explain this notation and to demonstrate how this general rule is reached. Assume we have generated independent base forecasts for each series in the hierarchy. Remember that for the top level “Total” series, $\tilde{y}_{h}=\hat{y}_{h}$. Here are some example using the above notation:

  • $\hat{y}_{\text{A},h}^{(1)}=\hat{y}_{\text{B},h}^{(1)}= \tilde{y}_{h}$
  • $\hat{y}_{\text{BA},h}^{(1)}=\hat{y}_{\text{BB},h}^{(1)}= \hat{y}_{\text{B},h}$
  • $\hat{y}_{\text{AA},h}^{(2)}=\hat{y}_{\text{AB},h}^{(2)}= \hat{y}_{\text{AC},h}^{(2)}=\hat{y}_{\text{BA},h}^{(2)}= \hat{y}_{\text{BB},h}^{(2)}= \tilde{y}_{h}$
  • $\hat{S}_{\text{BA},h}^{(1)} = \hat{S}_{\text{BB},h}^{(1)}= \hat{y}_{\text{BA},h}+\hat{y}_{\text{BB},h}$
  • $\hat{S}_{\text{BA},h}^{(2)} = \hat{S}_{\text{BB},h}^{(2)}= \hat{S}_{\text{A},h}^{(1)} = \hat{S}_{\text{B},h}^{(1)}= \hat{S}_{h}= \hat{y}_{\text{A},h}+\hat{y}_{\text{B},h}$

Moving down the farthest left branch of the hierarchy the final revised forecasts are

$$ \tilde{y}_{\text{A},h} = (\frac{\hat{y}_{\text{A},h}}{\hat{S}_{\text{A},h}^{(1)}}) \tilde{y}_{h} = (\frac{\hat{y}_{\text{AA},h}^{(1)}}{\hat{S}_{\text{AA},h}^{(2)}}) \tilde{y}_{h} $$

and

$$\tilde{y}_{\text{AA},h} = (\frac{\hat{y}_{\text{AA},h}}{\hat{S}_{\text{AA},h}^{(1)}}) \tilde{y}_{\text{A},h} =(\frac{\hat{y}_{\text{AA},h}}{\hat{S}_{\text{AA},h}^{(1)}}) (\frac{\hat{y}_{\text{AA},h}^{(1)}}{\hat{S}_{\text{AA},h}^{(2)}})\tilde{y}_{h}. $$

Consequently,

$$ p_1=(\frac{\hat{y}_{\text {AA},h}}{\hat{S}_{\text {AA},h}^{(1)}}) (\frac{\hat{y}_{\text {AA},h}^{(1)}}{\hat{S}_{\text {AA},h}^{(2)}}) $$

The other proportions can be similarly obtained. The greatest disadvantage of the top-down forecasted proportions approach, which is a disadvantage of any top-down approach, is that they do not produce unbiased revised forecasts even if the base forecasts are unbiased.

Middle-out approach

The middle-out approach combines bottom-up and top-down approaches. First the “middle level” is chosen and base forecasts are generated for all the series of this level and the ones below. For the series above the middle level, revised forecasts are generated using the bottom-up approach by aggregating the “middle-level” base forecasts upwards. For the series below the “middle level”, revised forecasts are generated using a top-down approach by disaggregating the “middle level” base forecasts downwards.

The optimal combination approach

This approach involves first generating independent base forecast for each series in the hierarchy. As these base forecasts are independently generated they will not be “aggregate consistent” (i.e., they will not add up according to the hierarchical structure). The optimal combination approach optimally combines the independent base forecasts and generates a set of revised forecasts that are as close as possible to the univariate forecasts but also aggregate consistently with the hierarchical structure.

Unlike any other existing method, this approach uses all the information available within a hierarchy. It allows for correlations and interactions between series at each level of the hierarchy, it accounts for ad hoc adjustments of forecasts at any level, and, provided the base forecasts are unbiased, it produces unbiased revised forecasts.

The general idea is derived from the representation of the $h$-step-ahead base forecasts for the whole of the hierarchy by the linear regression model. We write

$$ \hat{\bm{y} }_ h= \bm{S} \bm{\beta }_{h}+\bm{\varepsilon }_{h}$$
where $\hat{\bm{y} }_h$ is a vector of the $h$-step-ahead base forecasts for the whole hierarchy, $\bm{\beta }_{h}$ is the unknown mean of the future values of the bottom level $K$, and $\bm{\varepsilon}_{h}$ has zero mean and covariance matrix $\Sigma_{h}$. Note that $\bm{\varepsilon}_{h}$ represents the error in the above regression and should not be confused with the $h$-step-ahead forecast error.
In general $\Sigma_{h}$ in unknown. However, if we assume that the errors approximately satisfy the same aggregation structure as the original data, (i.e., $\bm{\varepsilon }_{h}\approx \bm{S} \bm{\varepsilon }_{K,h}$ where $\bm{\varepsilon }_{K,h}$ contains the forecast errors in the bottom level), then the best linear unbiased estimator for $\bm{\beta}_{h}$ is $\hat{\bm{\beta }}_{h}=(\bm{S} ’\bm{S} )^{-1}\bm{S} ’\hat{\bm{y} }_ n(h)$. This leads to a set of revised forecasts given by
$$ \tilde{\bm{y} }_{h}=\bm{S} (\bm{S} ’\bm{S} )^{-1}\bm{S} ’\hat{\bm{y} }_h. $$

Assuming that the errors approximately satisfy the hierarchical aggregation structure will be true provided that the base forecasts also approximately satisfy this aggregation structure, which should occur for any reasonable set of forecasts.

The hts package

Hierarchical time series forecasting is implemented in the hts package in R.

The hts function creates a hierarchical time series. The required inputs are the bottom level time series obsverations, and information about the hierarchical structure. For example, the structure shown in Figure 9.12 is specified as follows:

R code
# bts is a time series matrix containing the bottom level series
# The first three series belong to one group, and the last two series
# belong to a different group
y <- hts(bts, nodes=list(2, c(3,2)))

For a grouped but non-hierarchical time series, the gts function can be used. If there are more levels, the g argument should be a matrix where each row contains the grouping structure for each level.

Forecasts are obtained, as usual, with the forecast function. By default it produces forecasts using the optimal combination approach with ETS models used for the base forecasts. But other models and methods can be specified via the following arguments.

fmethod : The forecasting model to be used for the base forecasts. Possible values are "ets", "arima" and "rw".
method : The method used for reconciling the base forecasts. It can take the following values:
comb : Optimal combination forecasts;
bu : Bottom-up forecasts;
mo : Middle-out forecasts where the level used is specified by the level argument’
tdgsa : Top-down forecasts based on the average historical proportions (Gross-Sohl method A);
tdgsf : Top-down forecasts based on the proportion of historical averages (Gross-Sohl method F);
tdfp : Top-down forecasts using forecast proportions.

Example 9.7 Australian tourism hierarchy

Australia is divided into eight geographical areas (some referred to as states and others as territories) with each one having its own government and some economic and administrative autonomy. Business planners and tourism authorities are interested in forecasts for the whole of Australia, the states and the territories, and also smaller regions. In this example we concentrate on quarterly domestic tourism demand, measured as the number of visitor nights Australians spend away from home, for the three largest states of Australia, namely Victoria (VIC), New South Wales (NSW), Queensland (QLD) and other. For each of these we consider visitor nights for each respective capital city, namely, Melbourne (MEL), Sydney (SYD) and Brisbane (BGC).3

The hierarchical structure is shown in Figure 9.13. The CAP category in the bottom level includes visitor nights in all other five capital cities of the remaining states and territories.

Figure 9.13: Australian tourism hierarchy.

Figure 9.14 shows all the times series in the hierarchy which span the period 1998:Q1 to 2011:Q4. The dotted lines show the revised forecasts generated by the optimal combination approach. The base forecasts for each series are generated using the ETS methodology of Chapter 7.

Figure 9.14: Hierarchical time series and 8-step-ahead revised forecasts for Australian domestic tourism generated by the optimal combination approach. The base forecasts for each series were generated using the ETS methodology of Chapter 7

R code
library(hts)
y <- hts(vn, nodes=list(4,c(2,2,2,2)))
allf <- forecast(y, h=8)
plot(allf)

  1. We will use this hierarchical structure in order to introduce the methods for forecasting hierarchical time series that follow. 

  2. We have simplified the previously used notation of $\hat{y}_{T+h|T}$ for brevity. 

  3. For the purpose of this example, we include with Brisbane, visitor nights at the Gold Coast, a coastal city and a major tourism attraction near Brisbane.