4.8 Regression with time series data

When using regression for prediction, we are often considering time series data and we are aiming to forecast the future. There are a few issues that arise with time series data but not with cross-sectional data that we will consider in this section.

Example 4.3 US consumption expenditure

Figure 4.10: Percentage changes in personal consumption expenditure for the US.

R code
fit.ex3 <- tslm(consumption ~ income, data=usconsumption)
plot(usconsumption, ylab="% change in consumption and income",
  plot.type="single", col=1:2, xlab="Year")
legend("topright", legend=c("Consumption","Income"),
 lty=1, col=c(1,2), cex=.9)
plot(consumption ~ income, data=usconsumption,
 ylab="% change in consumption", xlab="% change in income")
R output
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.52062    0.06231   8.356 2.79e-14 ***
income       0.31866    0.05226   6.098 7.61e-09 ***

Figure 4.10 shows time series plots of quarterly percentage changes (growth rates) of real personal consumption expenditure ($C$) and real personal disposable income ($I$) for the US for the period March 1970 to Dec 2010. Also shown is a scatter plot including the estimated regression line $$\hat{C}=0.52+0.32I, $$ with the estimation results are shown below the graphs. These show that a $1\%$ increase in personal disposable income will result to an average increase of $0.84\%$ in personal consumption expenditure. We are interested in forecasting consumption for the four quarters of 2011.

Using a regression model to forecast time series data poses a challenge in that future values of the predictor variable (Income in this case) are needed to be input into the estimated model, but these are not known in advance. One solution to this problem is to use “scenario based forecasting”.

Scenario based forecasting

In this setting the forecaster assumes possible scenarios for the predictor variable that are of interest. For example the US policy maker may want to forecast consumption if there is a 1% growth in income for each of the quarters in 2011. Alternatively a 1% decline in income for each of the quarters may be of interest. The resulting forecasts are calculated and shown in Figure 4.11.

Figure 4.11: Forecasting percentage changes in personal consumption expenditure for the US.

R code
fcast <- forecast(fit.ex3, newdata=data.frame(income=c(-1,1)))
plot(fcast, ylab="% change in consumption", xlab="% change in income")

Forecast intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables. They assume the value of the predictor is known in advance.

An alternative approach is to use genuine forecasts for the predictor variable. For example, a pure time series based approach can be used to generate forecasts for the predictor variable (more on this in Chapter 9) or forecasts published by some other source such as a government agency can be used.

Ex-ante versus ex-post forecasts

When using regression models with time series data, we need to distinguish between two different types of forecasts that can be produced, depending on what is assumed to be known when the forecasts are computed.

Ex ante forecasts are those that are made using only the information that is available in advance. For example, ex ante forecasts of consumption for the four quarters in 2011 should only use information that was available before 2011. These are the only genuine forecasts, made in advance using whatever information is available at the time.

Ex post forecasts are those that are made using later information on the predictors. For example, ex post forecasts of consumption for each of the 2011 quarters may use the actual observations of income for each of these quarters, once these have been observed. These are not genuine forecasts, but are useful for studying the behaviour of forecasting models.

The model from which ex-post forecasts are produced should not be estimated using data from the forecast period. That is, ex-post forecasts can assume knowledge of the predictor variable (the $x$ variable), but should not assume knowledge of the data that are to be forecast (the $y$ variable).

A comparative evaluation of ex ante forecasts and ex post forecasts can help to separate out the sources of forecast uncertainty. This will show whether forecast errors have arisen due to poor forecasts of the predictor or due to a poor forecasting model.

Example 4.4 Linear trend

A common feature of time series data is a trend. Using regression we can model and forecast the trend in time series data by including $t=1,\ldots,T,$ as a predictor variable: $$y_t=\beta_0+\beta_1t+\varepsilon_t. $$ Figure 4.12 shows a time series plot of aggregate tourist arrivals to Australia over the period 1980 to 2010 with the fitted linear trend line $\hat{y}_t= 0.3375+0.1761t$. Also plotted are the point and forecast intervals for the years 2011 to 2015.

Figure 4.12: Forecasting international tourist arrivals to Australia for the period 2011-2015 using a linear trend. 80% and 95% forecast intervals are shown.

R code
fit.ex4 <- tslm(austa ~ trend)
f <- forecast(fit.ex4, h=5,level=c(80,95))
plot(f, ylab="International tourist arrivals to Australia (millions)",
R output
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.337535   0.100366   3.363  0.00218 **
trend       0.176075   0.005475  32.157  < 2e-16 ***

Residual autocorrelation

With time series data it is highly likely that the value of a variable observed in the current time period will be influenced by its value in the previous period, or even the period before that, and so on. Therefore when fitting a regression model to time series data, it is very common to find autocorrelation in the residuals. In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be utilized in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”, but they will usually have larger prediction intervals than they need to.

Figure 4.13 plots the residuals from Examples 4.3 and 4.4, and the ACFs of the residuals (see Section 2/2 for an introduction to the ACF). The ACF of the consumption residuals shows a significant spike at lag 2 and the ACF of the tourism residuals shows significant spikes at lags 1 and 2. Usually plotting the ACFs of the residuals are adequate to reveal any potential autocorrelation in the residuals. More formal tests for autocorrelation are discussed in Section 5/4.

Figure 4.13: Residuals from the regression models for Consumption and Tourism. Because these involved time series data, it is important to look at the ACF of the residuals to see if there is any remaining information not accounted for by the model. In both these examples, there is some remaining autocorrelation in the residuals.

R code
res3 <- ts(resid(fit.ex3),s=1970.25,f=4)
plot.ts(res3,ylab="res (Consumption)")
res4 <- resid(fit.ex4)
plot(res4,ylab="res (Tourism)")

Spurious regression

More often than not, time series data are “non-stationary”; that is, the values of the time series do not fluctuate around a constant mean or with a constant variance. We will deal with time series stationarity in more detail in Chapter 8, but here we need to address the effect non-stationary data can have on regression models.

For example consider the two variable plotted in Figure 4.14, which appear to be related simply because they both trend upwards in the same manner. However, air passenger traffic in Australia has nothing to do with rice production in Guinea. Selected output obtained from regressing the number of air passengers transported in Australia versus rice production in Guinea (in metric tons) is also shown in Figure 4.14.

Figure 4.14: Trending time series data can appear to be related, as shown in this example in which air passengers in Australia are regressed against rice production in Guinea.

R output
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.7297     0.9026  -6.348  1.9e-07 ***
guinearice   37.4101     1.0487  35.672  < 2e-16 ***

Multiple R-squared: 0.971.
 lag     Autocorrelation  
   1           0.7496971

Regressing non-stationary time series can lead to spurious regressions. High $R^2$s and high residual autocorrelation can be signs of spurious regression. We discuss the issues surrounding non-stationary data and spurious regressions in detail in Chapter 9.

Cases of spurious regression might appear to give reasonable short-term forecasts, but they will generally not continue to work into the future.