5.2 Some useful predictors

Dummy variables

So far, we have assumed that each predictor takes numerical values. But what about when a predictor is a categorical variable taking only two values (e.g., "yes" and "no"). Such a variable might arise, for example, when forecasting credit scores and you want to take account of whether the customer is in full-type employment. So the predictor takes value "yes" when the customer is in full-time employment, and "no" otherwise.

This situation can still be handled within the framework of multiple regression models by creating a "dummy variable" taking value 1 corresponding to "yes" and 0 corresponding to "no". A dummy variable is also known as an "indicator variable".

If there are more than two categories, then the variable can be coded using several dummy variables (one fewer than the total number of categories).

Seasonal dummy variables

For example, suppose we are forecasting daily electricity demand and we want to account for the day of the week as a predictor. Then the following dummy variables can be created.

Day D1 D2 D3 D4 D5 D6
Monday 1 0 0 0 0 0
Tuesday 0 1 0 0 0 0
Wednesday 0 0 1 0 0 0
Thursday 0 0 0 1 0 0
Friday 0 0 0 0 1 0
Saturday 0 0 0 0 0 1
Sunday 0 0 0 0 0 0
Monday 1 0 0 0 0 0
Tuesday 0 1 0 0 0 0
Wednesday 0 0 1 0 0 0
Thursday 0 0 0 1 0 0
$\vdots$ $\vdots$ $\vdots$ $\vdots$ $\vdots$ $\vdots$ $\vdots$

Notice that only six dummy variables are needed to code seven categories. That is because the seventh category (in this case Sunday) is specified when the dummy variables are all set to zero.

Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the "dummy variable trap" because it will cause the regression to fail. There will be too many parameters to estimate. The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables.

The interpretation of each of the coefficients associated with the dummy variables is that it is a measure of the effect of that category relative to the omitted category. In the above example, the coefficient associated with Monday will measure the effect of Monday compared to Sunday on the forecast variable.

Outliers

If there is an outlier in the data, rather than omit it, you can use a dummy variable to remove its effect. In this case, the dummy variable takes value one for that observation and zero everywhere else.

Public holidays

For daily data, the effect of public holidays can be accounted for by including a dummy variable predictor taking value one on public holidays and zero elsewhere.

Easter

Easter is different from most holidays because it is not held on the same date each year and the effect can last for several days. In this case, a dummy variable can be used with value one where any part of the holiday falls in the particular time period and zero otherwise.

For example, with monthly data, when Easter falls in March then the dummy variable takes value 1 in March, when it falls in April, the dummy variable takes value 1 in April, and when it starts in March and finishes in April, the dummy variable takes value 1 for both months.

Trend

A linear trend is easily accounted for by including the predictor $x_{1,t}=t$. A piecewise linear trend with a bend at time $\tau$ can be specified by including the following predictors in the model.

\begin{align*} x_{1,t} & = t \\ x_{2,t} &= \left\{ \begin{array}{ll} 0 & t < \tau\\ (t-\tau) & t \ge \tau \end{array}\right. \end{align*}

A quadratic or higher order trend is obtained by specifying [ x_{1,t} =t,\quad x_{2,t}=t^2,\quad \dots ] However, it is not recommended that quadratic or higher order trends are used in forecasting. When they are extrapolated, the resulting forecasts are often very unrealistic.

A better approach is to use a piecewise linear trend which bends at some time. If the trend bends at time $\tau$, then it can be specified by including the following predictors in the model.

\begin{align*} x_{1,t} & = t \\ x_{2,t} &= \left\{ \begin{array}{ll} 0 & t < \tau\\ (t-\tau) & t \ge \tau \end{array}\right. \end{align*}

If the associated coefficients of $x_{1,t}$ and $x_{2,t}$ are $\beta_1$ and $\beta_2$, then $\beta_1$ gives the slope of the trend before time $\tau$, while the slope of the line after time $\tau$ is given by $\beta_1+\beta_2$.

An extension of this idea is to use a spline (see Section 5/6)

Ex post and ex ante forecasting

As discussed in Section 4/8, ex ante forecasts are those that are made using only the information that is available in advance, while ex post forecasts are those that are made using later information on the predictors.

Normally, we cannot use future values of the predictor variables when producing ex ante forecasts because their values will not be known in advance. However, the special predictors introduced in this section are all known in advance, as they are based on calendar variables (e.g., seasonal dummy variables or public holiday indicators) or deterministic functions of time. In such cases, there is no difference betweeen ex ante and ex post forecasts.

Example: Australian quarterly beer production

We can model the Australian beer production data using a regression model with a linear trend and quarterly dummy variables: [ y_{t} = \beta_{0} + \beta_{1} t + \beta_{2}d_{2,t} + \beta_3 d_{3,t} + \beta_4 d_{4,t} + e_{t}, ] here $d_{i,t} = 1$ if $t$ is in quarter $i$ and 0 otherwise. The first quarter variable has been omitted, so the coefficients associated with the other quarters are measures of the difference between those quarters and the first quarter.

Computer output from this model is given below.

R code
beer2 <- window(ausbeer,start=1992,end=2006-.1)
fit <- tslm(beer2 ~ trend + season)
summary(fit)
R output
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 441.8141     4.5338  97.449  < 2e-16
trend        -0.3820     0.1078  -3.544 0.000854
season2     -34.0466     4.9174  -6.924 7.18e-09
season3     -18.0931     4.9209  -3.677 0.000568
season4      76.0746     4.9268  15.441  < 2e-16

Residual standard error: 13.01 on 51 degrees of freedom
Multiple R-squared: 0.921,  Adjusted R-squared: 0.9149

So there is a strong downward trend of 0.382 megalitres per quarter. On average, the second quarter has production of 34.0 megalitres lower than the first quarter, the third quarter has production of 18.1 megalitres lower than the first quarter, and the fourth quarter has production 76.1 megalitres higher than the first quarter. The model explains 92.1% of the variation in the beer production data.

The following plots show the actual values compared to the predicted values.

Figure 5.5: Time plot of beer production and predicted beer production.

Figure 5.6: Actual beer production plotted against predicted beer production.

Figure 5.7: Forecasts from the regression model for beer production. The dark blue region shows 80% prediction intervals and the light blue region shows 95% prediction intervals.

Forecasts obtained from the model are shown in Figure 5.7.

Intervention variables

It is often necessary to model interventions that may have affected the variable to be forecast. For example, competitor activity, advertising expenditure, industrial action, and so on, can all have an effect.

When the effect lasts only for one period, we use a spike variable. This is a dummy variable taking value one in the period of the intervention and zero elsewhere. A spike variable is equivalent to a dummy variable for handling an outlier.

Other interventions have an immediate and permanent effect. If an intervention causes a level shift (i.e., the value of the series changes suddenly and permanently from the time of intervention), then we use a step variable. A step variable takes value zero before the intervention and one from the time of intervention onwards.

Another form of permanent effect is a change of slope. Here the intervention is handled using a piecewise linear trend as discussed earlier (where $\tau$ is the time of intervention).

Trading days

The number of trading days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor. An alternative that allows for the effects of different days of the week has the following predictors.

\begin{align*} x_{1} &= \text{# Mondays in month;} \\ x_{2} &= \text{# Tuesdays in month;} \\ & \vdots \\ x_{7} &= \text{# Sundays in month.} \end{align*}

Distributed lags

It is often useful to include advertising expenditure as a predictor. However, since the effect of advertising can last beyond the actual campaign, we need to include lagged values of advertising expenditure. So the following predictors may be used.

\begin{align*} x_{1} &= \text{advertising for previous month;} \\ x_{2} &= \text{advertising for two months previously;} \\ &\vdots \\ x_{m} &= \text{advertising for $m$ months previously.} \end{align*}

It is common to require the coefficients to decrease as the lag increases. In Chapter 9 we discuss methods to allow this constraint to be implemented.