# 5.7 Correlation, causation and forecasting

## Correlation is not causation

It is important not to confuse correlation with causation, or causation
with forecasting. A variable $x$ may be useful for predicting a
variable $y$, but that does not mean $x$ is causing $y$. It is
possible that $x$ *is* causing $y$, but it may be that the
relationship between them is more complicated than simple causality.

For
example, it is possible to model the number of drownings at a beach
resort each month with the number of ice-creams sold in the same period.
The model can give reasonable forecasts, not because ice-creams cause
drownings, but because people eat more ice-creams on hot days when they
are also more likely to go swimming. So the two variables (ice-cream
sales and drownings) are correlated, but one is not causing the other.
It is important to understand that **correlations are useful for
forecasting, even when there is no causal relationship between the two
variables**.

However, often a better model is possible if a causal
mechanism can be determined. In this example, both ice-cream sales and
drownings will be affected by the temperature and by the numbers of
people visiting the beach resort. Again, high temperatures do not
actually *cause* people to drown, but they are more directly related to
why people are swimming. So a better model for drownings will probably
include temperatures and visitor numbers and exclude ice-cream sales.

## Confounded predictors

A related issue involves confounding variables. Suppose we are
forecasting monthly sales of a company for 2012, using data from
2000–2011. In January 2008 a new competitor came into the market and
started taking some market share. At the same time, the economy began to
decline. In your forecasting model, you include both competitor activity
(measured using advertising time on a local television station) and the
health of the economy (measured using GDP). It will not be possible to
separate the effects of these two predictors because they are
correlated. We say two variables are **confounded** when their effects
on the forecast variable cannot be separated. Any pair of correlated
predictors will have some level of confounding, but we would not normally
describe them as confounded unless there was a relatively high level of
correlation between them.

Confounding is not really a problem for forecasting, as we can still compute forecasts without needing to separate out the effects of the predictors. However, it becomes a problem with scenario forecasting as the scenarios should take account of the relationships between predictors. It is also a problem if some historical analysis of the contributions of various predictors is required.

## Multicollinearity and forecasting

A closely related issue is **multicollinearity** which occurs when
similar information is provided by two or more of the predictor
variables in a multiple regression. It can occur in a number of ways.

- Two predictors are highly correlated with each other (that is, they have a correlation coefficient close to +1 or -1). In this case, knowing the value of one of the variables tells you a lot about the value of the other variable. Hence, they are providing similar information.
- A linear combination of predictors is highly correlated with another linear combination of predictors. In this case, knowing the value of the first group of predictors tells you a lot about the value of the second group of predictors. Hence, they are providing similar information.

The dummy variable trap is a special case of multicollinearity. Suppose you have quarterly data and use four dummy variables, $D_1,D_2,D_3$ and $D_4$. Then $D_4=1-D_1-D_2-D_3$, so there is perfect correlation between $D_4$ and $D_1+D_2+D_3$.

When multicollinearity occurs in a multiple regression model, there are several consequences that you need to be aware of.

- If there is perfect correlation (i.e., a correlation of +1 or -1, such as in the dummy variable trap), it is not possible to estimate the regression model.
- If there is high correlation (close to but not equal to +1 or -1), then the estimation of the regression coefficients is computationally difficult. In fact, some software (notably Microsoft Excel) may give highly inaccurate estimates of the coefficients. Most reputable statistical software will use algorithms to limit the effect of multicollinearity on the coefficient estimates, but you do need to be careful. The major software packages such as R, SPSS, SAS and Stata all use estimation algorithms to avoid the problem as much as possible.
- The uncertainty associated with individual regression coefficients will be large. This is because they are difficult to estimate. Consequently, statistical tests (e.g., t-tests) on regression coefficients are unreliable. (In forecasting we are rarely interested in such tests and they have not been discussed in this book.) Also, it will not be possible to make accurate statements about the contribution of each separate predictor to the forecast.
- Forecasts will be unreliable if the values of the future predictors are outside the range of the historical values of the predictors. For example, suppose you have fitted a regression model with predictors $X$ and $Z$ which are highly correlated with each other, and suppose that the values of $X$ in the fitting data ranged between 0 and 100. Then forecasts based on $X>100$ or $X<0$ will be unreliable. It is always a little dangerous when future values of the predictors lie much outside the historical range, but it is especially problematic when multicollinearity is present.

Note that if you are using good statistical software, if you are not interested in the specific contributions of each predictor, and if the future values of your predictor variables are within their historical ranges, there is nothing to worry about — multicollinearity is not a problem.