9.4 Forecasting hierarchical or grouped time series
Warning: this is a more advanced section and assumes knowledge of matrix algebra.
Time series can often be naturally disaggregated in a hierarchical structure using attributes such as geographical location, product type, etc. For example, the total number of bicycles sold by a cycling warehouse can be disaggregated into a hierarchy of bicycle types. Such a warehouse will sell road bikes, mountain bikes, children bikes or hybrids. Each of these can be disaggregated into finer categories. Children’s bikes can be divided into balance bikes for children under 4 years old, single speed bikes for children between 4 and 6 and bikes for children over the age of 6. Hybrid bikes can be divided into city, commuting, comfort, and trekking bikes; and so on. Such disaggregation imposes a hierarchical structure. We refer to these as hierarchical time series.
Another possibility is that series can be naturally grouped together based on attributes without necessarily imposing a hierarchical structure. For example the bicycles sold by the warehouse can be for males, females or unisex. They can be used for racing, commuting or recreational purposes. They can be single speed or have multiple gears. Frames can be carbon, aluminium or steel. Grouped time series can be thought of as hierarchical time series that do not impose a unique hierarchical structure in the sense that the order by which the series can be grouped is not unique.
In this section we present alternative approaches for forecasting time series data that possess hierarchical structures. Figure 9.12 shows a $K=2$level hierarchy.^{1} At the top of the hierarchy, level 0, is the “Total”, the most aggregate level of the data. We denote as $y_ t$ the $t$th observation of the “Total” series for $t=1,\dots ,T$. Below this level we denote as $y_{\text{j},t}$ the $t$th observation of the series which corresponds to node $j$ of the hierarchical tree. The “Total” is disaggregated into 2 series at level 1 and each of these into 3 and 2 series respectively at the bottom level of the hierarchy, level 2. The total number of series in a hierarchy is given by $n=1+n_1+\dots +n_ K$ where $n_ i$ is the number of series at level $i$ of the hierarchy. In this case $n=1+2+5=8$.
For any time $t$, the observations of the bottom level series will aggregate to the observations of the series above. This can be effectively represented using matrix notation. We construct an $n\times n_ K$ matrix referred to as the “summing” matrix $\bm{S} $ which dictates how the bottom level series are aggregated, consistent with the hierarchical structure. For the hierarchy in Figure 9.12 we can write
or in more compact notation
where $\bm{y}_ t$ is a vector of all the observations in the hierarchy at time $t$, $\bm{S} $ is the summing matrix as defined above, and $\bm{y}_{K,t}$ is a vector of all the observation in the bottom level of the hierarchy at time $t$.
We are interested in generating forecasts for each series in the hierarchy. We denote as $\hat{y}_{\text{j},h}$ the $h$stepahead forecast generated for the series at node $j$ having observed the time series up to observation $T$ and as $\hat{y}_{h}$ the $h$stepahead forecast generated for the “Total” series.^{2} We refer to these as “base” forecasts. They are independent forecasts generated for each series in the hierarchy using a suitable forecasting method presented in earlier sections of this book. These base forecasts are then combined to produce final forecasts for the whole hierarchy that aggregate in a manner that is consistent with the structure of the hierarchy. We refer to these as revised forecasts and denote them as $\tilde{y}_{\text{j},h}$ and $\tilde{y}_{h}$ respectively.
There are a number of ways of combining the base forecasts in order to obtain revised forecasts. The following sections discuss some of the possible combining approaches.
The bottomup approach
A commonly applied method for hierarchical forecasting is the bottomup approach. This approach involves first generating base independent forecasts for each series at the bottom level of the hierarchy and then aggregating these upwards to produce revised forecasts for the whole hierarchy.
Using matrix notation we can again employ the summing matrix and write
The greatest advantage of this approach is that no information is lost due to aggregation. On the other hand bottom level data can be quite noisy and more challenging to model and forecast.
Topdown approaches
Topdown approaches involve first generating base forecasts for the “Total” series $y_ t$ on the top of the hierarchy and then disaggregating these downwards. We let $ p_1,\dots ,p_{n_ K} $ be a set of proportions which dictate how the base forecasts of the “Total” series are to be distributed to revised forecasts for each series at the bottom level of the hierarchy. Once the bottom level forecasts have been generated we can use the summing matrix to generate forecasts for the rest of the series in the hierarchy. Note that for topdown approaches the top level revised forecasts are equal to the top level base forecasts, i.e., $\tilde{y}_{h}=\hat{y}_{h}$.
The most common topdown approaches specify proportions based on the historical proportions of the data. The two most common versions follow.
Average historical proportions
$$p_j = \frac{1}{T}\sum_{t=1}^T \frac{y_{j,t}}{y_t}$$ for $j=1,\dots ,n_ K$. Each proportion $p_ j$ reflects the average of the historical proportions of the bottom level series $y_{j,t}$ over the period $t=1,\dots ,T$ relative to the total aggregate $y_ t$.
Proportions of the historical averages
[ p_ j={\sum_{t=1}^{T}\frac{y_{j,t}}{T}}/{\sum_{t=1}^{T}\frac{y_ t}{T}} ] for $j=1,\dots ,n_ K$. Each proportion $p_ j$ captures the average historical value of the bottom level series $y_{j,t}$ relative to the average value of the total aggregate $y_ t$.
The greatest attribute of such topdown approaches is their simplicity to apply. One only needs to model and generate forecasts for the most aggregated top level series. In general these approaches seem to produce quite reliable forecasts for the aggregate levels and they are very useful with low count data. On the other hand, their greatest disadvantage is the loss of information due to aggregation. With these topdown approaches, we are unable to capture and take advantage of individual series characteristics such as time dynamics, special events, etc.
In the example on forecasting Australian domestic tourism demand that follows, the data are highly seasonal. The seasonal pattern of tourist arrivals may vary across series depending on the tourism destination. An area with beaches as its main tourist attractions will have a very different seasonal pattern to an area with skiing as its main tourist attraction. This will not be captured by disaggregating the total of these destinations based on historical proportions. Finally, with these methods the disaggregation of the “Total” series forecasts depends on historical and static proportions, and these proportions may be distorted by trends in the data.
Forecasted proportions
An alternative approach that improves on the historical and static nature of the proportions specified above is to use forecasted proportions.
To demonstrate the intuition of this method, consider a one level hierarchy. We first generate $h$stepahead base forecasts for all the series independently. At level 1 we calculate the proportion of each $h$stepahead base forecast to the aggregate of all the $h$stepahead base forecasts at this level. We refer to these as the forecasted proportions and we use these to disaggregate the top level forecast and generate revised forecasts for the whole of the hierarchy.
For a $K$level hierarchy this process is repeated for each node going from the top to the very bottom level. Applying this process leads to the following general rule for obtaining the forecasted proportions
for $j=1,2,\dots ,n_ K$. These forecasted proportions disaggregate the $h$stepahead base forecast of the “Total” series to $h$stepahead revised forecasts of the bottom level series. $\hat{y}_{j,h}^{(\ell )}$ is the $h$stepahead base forecast of the series that corresponds to the node which is $\ell $ levels above $j$. $\hat{S}_{j,h}^{(\ell )}$ is the sum of the $h$stepahead base forecasts below the node that is $\ell $ levels above node $j$ and are directly connected to that node.
We will use the hierarchy of Figure 9.12 to explain this notation and to demonstrate how this general rule is reached. Assume we have generated independent base forecasts for each series in the hierarchy. Remember that for the top level “Total” series, $\tilde{y}_{h}=\hat{y}_{h}$. Here are some example using the above notation:
 $\hat{y}_{\text{A},h}^{(1)}=\hat{y}_{\text{B},h}^{(1)}= \tilde{y}_{h}$
 $\hat{y}_{\text{BA},h}^{(1)}=\hat{y}_{\text{BB},h}^{(1)}= \hat{y}_{\text{B},h}$
 $\hat{y}_{\text{AA},h}^{(2)}=\hat{y}_{\text{AB},h}^{(2)}= \hat{y}_{\text{AC},h}^{(2)}=\hat{y}_{\text{BA},h}^{(2)}= \hat{y}_{\text{BB},h}^{(2)}= \tilde{y}_{h}$
 $\hat{S}_{\text{BA},h}^{(1)} = \hat{S}_{\text{BB},h}^{(1)}= \hat{y}_{\text{BA},h}+\hat{y}_{\text{BB},h}$
 $\hat{S}_{\text{BA},h}^{(2)} = \hat{S}_{\text{BB},h}^{(2)}= \hat{S}_{\text{A},h}^{(1)} = \hat{S}_{\text{B},h}^{(1)}= \hat{S}_{h}= \hat{y}_{\text{A},h}+\hat{y}_{\text{B},h}$
Moving down the farthest left branch of the hierarchy the final revised forecasts are
and
Consequently,
The other proportions can be similarly obtained. The greatest disadvantage of the topdown forecasted proportions approach, which is a disadvantage of any topdown approach, is that they do not produce unbiased revised forecasts even if the base forecasts are unbiased.
Middleout approach
The middleout approach combines bottomup and topdown approaches. First the “middle level” is chosen and base forecasts are generated for all the series of this level and the ones below. For the series above the middle level, revised forecasts are generated using the bottomup approach by aggregating the “middlelevel” base forecasts upwards. For the series below the “middle level”, revised forecasts are generated using a topdown approach by disaggregating the “middle level” base forecasts downwards.
The optimal combination approach
This approach involves first generating independent base forecast for each series in the hierarchy. As these base forecasts are independently generated they will not be “aggregate consistent” (i.e., they will not add up according to the hierarchical structure). The optimal combination approach optimally combines the independent base forecasts and generates a set of revised forecasts that are as close as possible to the univariate forecasts but also aggregate consistently with the hierarchical structure.
Unlike any other existing method, this approach uses all the information available within a hierarchy. It allows for correlations and interactions between series at each level of the hierarchy, it accounts for ad hoc adjustments of forecasts at any level, and, provided the base forecasts are unbiased, it produces unbiased revised forecasts.
The general idea is derived from the representation of the $h$stepahead base forecasts for the whole of the hierarchy by the linear regression model. We write
Assuming that the errors approximately satisfy the hierarchical aggregation structure will be true provided that the base forecasts also approximately satisfy this aggregation structure, which should occur for any reasonable set of forecasts.
The hts package
Hierarchical time series forecasting is implemented in the hts package in R.
The hts
function creates a hierarchical time series. The
required inputs are the bottom level time series obsverations, and
information about the hierarchical structure. For example, the structure
shown in Figure 9.12 is specified as follows:
# The first three series belong to one group, and the last two series
# belong to a different group
y < hts(bts, nodes=list(2, c(3,2)))
For a grouped but nonhierarchical time series, the gts
function can
be used. If there are more levels, the g
argument should be a matrix
where each row contains the grouping structure for each level.
Forecasts are obtained, as usual, with the forecast
function. By default it
produces forecasts using the optimal combination approach with ETS
models used for the base forecasts. But other models and methods can be
specified via the following arguments.
fmethod
: The forecasting model to be used for the base forecasts. Possible values are"ets"
,"arima"
and"rw"
.method
: The method used for reconciling the base forecasts. It can take the following values:comb
: Optimal combination forecasts;bu
: Bottomup forecasts;mo
: Middleout forecasts where the level used is specified by thelevel
argument’tdgsa
: Topdown forecasts based on the average historical proportions (GrossSohl method A);tdgsf
: Topdown forecasts based on the proportion of historical averages (GrossSohl method F);tdfp
: Topdown forecasts using forecast proportions.
Example 9.7 Australian tourism hierarchy
Australia is divided into eight geographical areas (some referred to as states and others as territories) with each one having its own government and some economic and administrative autonomy. Business planners and tourism authorities are interested in forecasts for the whole of Australia, the states and the territories, and also smaller regions. In this example we concentrate on quarterly domestic tourism demand, measured as the number of visitor nights Australians spend away from home, for the three largest states of Australia, namely Victoria (VIC), New South Wales (NSW), Queensland (QLD) and other. For each of these we consider visitor nights for each respective capital city, namely, Melbourne (MEL), Sydney (SYD) and Brisbane (BGC).^{3}
The hierarchical structure is shown in Figure 9.13. The CAP category in the bottom level includes visitor nights in all other five capital cities of the remaining states and territories.
Figure 9.14 shows all the times series in the hierarchy which span the period 1998:Q1 to 2011:Q4. The dotted lines show the revised forecasts generated by the optimal combination approach. The base forecasts for each series are generated using the ETS methodology of Chapter 7.
y < hts(vn, nodes=list(4,c(2,2,2,2)))
allf < forecast(y, h=8)
plot(allf)

We will use this hierarchical structure in order to introduce the methods for forecasting hierarchical time series that follow. ↩

We have simplified the previously used notation of $\hat{y}_{T+hT}$ for brevity. ↩

For the purpose of this example, we include with Brisbane, visitor nights at the Gold Coast, a coastal city and a major tourism attraction near Brisbane. ↩