1.4 Forecasting data and methods

The appropriate forecasting methods depend largely on what data are available.

If there are no data available, or if the data available are not relevant to the forecasts, then qualitative forecasting methods must be used. These methods are not purely guesswork---there are well-developed structured approaches to obtaining good forecasts without using historical data. These methods are discussed in Chapter 3.

Quantitative forecasting can be applied when two conditions are satisfied:

  1. numerical information about the past is available;
  2. it is reasonable to assume that some aspects of the past patterns will continue into the future.

There is a wide range of quantitative forecasting methods, often developed within specific disciplines for specific purposes. Each method has its own properties, accuracies, and costs that must be considered when choosing a specific method. Most quantitative forecasting problems use either time series data (collected at regular intervals over time) or cross-sectional data (collected at a single point in time).

Cross-sectional forecasting

With cross-sectional data, we are wanting to predict the value of something we have not observed, using the information on the cases that we have observed. Examples of cross-sectional data include:

  • House prices for all houses sold in 2011 in a particular area. We are interested in predicting the price of a house not in our data set using various house characteristics: position, no. bedrooms, age, etc.
  • Fuel economy data for a range of 2009 model cars. We are interested in predicting the carbon footprint of a vehicle not in our data set using information such as the size of the engine and the fuel efficiency of the car.

Example 1.1 Car emissions

Table 1.1 gives some data on 2009 model cars, each of which has an automatic transmission, four cylinders and an engine size under 2 liters.

Model Engine (litres) City (mpg) Highway (mpg) Carbon (tons CO2 per year)
Chevrolet Aveo 1.6 25 34 6.6
Chevrolet Aveo 5 1.6 25 34 6.6
Honda Civic 1.8 25 36 6.3
Honda Civic Hybrid 1.3 40 45 4.4
Honda Fit 1.5 27 33 6.1
Honda Fit 1.5 28 35 5.9
Hyundai Accent 1.6 26 35 6.3
Kia Rio 1.6 26 35 6.1
Nissan Versa 1.8 27 33 6.3
Nissan Versa 1.8 24 32 6.8
Pontiac G3 Wave 1.6 25 34 6.6
Pontiac G3 Wave 5 1.6 25 34 6.6
Pontiac Vibe 1.8 26 31 6.6
Saturn Astra 2DR Hatchback 1.8 24 30 6.8
Saturn Astra 4DR Hatchback 1.8 24 30 6.8
Scion xD 1.8 26 32 6.6
Toyota Corolla 1.8 27 35 6.1
Toyota Matrix 1.8 25 31 6.6
Toyota Prius 1.5 48 45 4.0
Toyota Yaris 1.5 29 35 5.9

Table 1.1: Fuel economy and carbon footprints for 2009 model cars with automatic transmissions, four cylinders and small engines. City and Highway represent fuel economy while driving in the city and on the highway.

A forecaster may wish to predict the carbon footprint (tons of CO2 per year) for other similar vehicles that are not included in the above table. It is necessary to first estimate the effects of the predictors (number of cylinders, size of engine, and fuel economy) on the variable to be forecast (carbon footprint). Then, provided that we know the predictors for a car not in the table, we can forecast its carbon footprint.

Cross-sectional models are used when the variable to be forecast exhibits a relationship with one or more other predictor variables. The purpose of the cross-sectional model is to describe the form of the relationship and use it to forecast values of the forecast variable that have not been observed. Under this model, any change in predictors will affect the output of the system in a predictable way, assuming that the relationship does not change. Models in this class include regression models, additive models, and some kinds of neural networks. These models are discussed in Chapters 4, 5 and 9.

Some people use the term "predict" for cross-sectional data and "forecast" for time series data (see below). In this book, we will not make this distinction---we will use the words interchangeably.

Time series forecasting

Time series data are useful when you are forecasting something that is changing over time (e.g., stock prices, sales figures, profits, etc.). Examples of time series data include:

  • Daily IBM stock prices
  • Monthly rainfall
  • Quarterly sales results for Amazon
  • Annual Google profits

Anything that is observed sequentially over time is a time series. In this book, we will only consider time series that are observed at regular intervals of time (e.g., hourly, daily, weekly, monthly, quarterly, annually). Irregularly spaced time series can also occur, but are beyond the scope of this book.

When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. The following figure shows the quarterly Australian beer production from 1992 to the third quarter of 2008.

Figure 1.1: Australian quarterly beer production: 1992Q1--2008Q3, with two years of forecasts.

The blue lines show forecasts for the next two years. Notice how the forecasts have captured the seasonal pattern seen in the historical data and replicated it for the next two years. The dark shaded region shows 80% prediction intervals. That is, each future value is expected to lie in the dark blue region with a probability of 80%. The light shaded region shows 95% prediction intervals. These prediction intervals are a very useful way of displaying the uncertainty in forecasts. In this case, the forecasts are expected to be very accurate, hence the prediction intervals are quite narrow.

Time series forecasting uses only information on the variable to be forecast, and makes no attempt to discover the factors which affect its behavior. Therefore it will extrapolate trend and seasonal patterns, but it ignores all other information such as marketing initiatives, competitor activity, changes in economic conditions, and so on.

Time series models used for forecasting include ARIMA models, exponential smoothing and structural models. These models are discussed in Chapters 6, 7 and 8.

Predictor variables and time series forecasting

Predictor variables can also be used in time series forecasting. For example, suppose we wish to forecast the hourly electricity demand (ED) of a hot region during the summer period. A model with predictor variables might be of the form

\begin{align*} \text{ED} = & f(\text{current temperature, strength of economy, population,}\\ & \qquad \text{time of day, day of week, error}). \end{align*}

The relationship is not exact---there will always be changes in electricity demand that cannot be accounted for by the predictor variables. The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. We call this an “explanatory model” because it helps explain what causes the variation in electricity demand.

Because the electricity demand data form a time series, we could also use a time series model for forecasting. In this case, a suitable time series forecasting equation is of the form

$$ \text{ED}_{t+1} = f(\text{ED}_{t}, \text{ED}_{t-1}, \text{ED}_{t-2}, \text{ED}_{t-3},\dots, \text{error}), $$

where $t$ is the present hour, $t+1$ is the next hour, $t-1$ is the previous hour, $t-2$ is two hours ago, and so on. Here, prediction of the future is based on past values of a variable, but not on external variables which may affect the system. Again, the "error" term on the right allows for random variation and the effects of relevant variables that are not included in the model.

There is also a third type of model which combines the features of the above two models. For example, it might be given by

$$ \text{ED}_{t+1} = f(\text{ED}_{t}, \text{current temperature, time of day, day of week, error}). $$

These types of mixed models have been given various names in different disciplines. They are known as dynamic regression models, panel data models, longitudinal models, transfer function models, and linear system models (assuming $f$ is linear). These models are discussed in Chapter 9.

An explanatory model is very useful because it incorporates information about other variables, rather than only historical values of the variable to be forecast. However, there are several reasons a forecaster might select a time series model rather than an explanatory model. First, the system may not be understood, and even if it was understood it may be extremely difficult to measure the relationships that are assumed to govern its behavior. Second, it is necessary to know or forecast the various predictors in order to be able to forecast the variable of interest, and this may be too difficult. Third, the main concern may be only to predict what will happen, not to know why it happens. Finally, the time series model may give more accurate forecasts than an explanatory or mixed model.

The model to be used in forecasting depends on the resources and data available, the accuracy of the competing models, and how the forecasting model is to be used.

Notation

For cross-sectional data, we will use the subscript $i$ to indicate a specific observation. For example, $y_i$ will denote the $i$th observation in a data set. We will also use $N$ to denote the total number of observations in the data set. For time series data, we will use the subscript $t$ instead of $i$. For example, $y_t$ will denote the observation at time $t$. We will use $T$ to denote the number of observations in a time series. When we are making general comments that could be applicable to either cross-sectional or time series data, we will tend to use $i$ and $N$.