4.3 Regression and correlation

The correlation coefficient $r$ was introduced in Section 2/2. Recall that $r$ measures the strength and the direction (positive or negative) of the linear relationship between the two variables. The stronger the linear relationship, the closer the observed data points will cluster around a straight line.

The slope coefficient $\hat{\beta}_1$ can also be expressed as $$\hat{\beta}_1=r\frac{s_{y}}{s_x},$$ where $s_y$ is the standard deviation of the $y$ observations and $s_x$ is the standard deviation of the $x$ observations.

So correlation and regression are strongly linked. The advantage of a regression model over correlation is that it asserts a predictive relationship between the two variables ($x$ predicts $y$) and quantifies this in a way that is useful for forecasting.

Example 4.1 Car emissions

Data on the carbon footprint and fuel economy for 2009 model cars were first introduced in Chapter 1. A scatter plot of Carbon (carbon footprint in tonnes per year) versus City (fuel economy in city driving conditions in miles per gallon) for all 134 cars is presented in Figure 4.3. Also plotted is the estimated regression line $$\hat{y}=12.53-0.22x.$$

Figure 4.3: Fitted regression line from regressing the carbon footprint of cars versus their fuel economy in city driving conditions.

R code
plot(jitter(Carbon) ~ jitter(City),xlab="City (mpg)",
ylab="Carbon footprint (tons per year)",data=fuel)
fit <- lm(Carbon ~ City, data=fuel)
abline(fit)
R output
> summary(fit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.525647   0.199232   62.87   <2e-16 ***
City        -0.220970   0.008878  -24.89   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 0.4703 on 132 degrees of freedom
Multiple R-squared: 0.8244,     Adjusted R-squared: 0.823
F-statistic: 619.5 on 1 and 132 DF,  p-value: < 2.2e-16

The regression estimation output from R is also shown. Notice the coefficient estimates in the column labelled “Estimate”. The other features of the output will be explained later in this chapter.

Interpreting the intercept, $\hat{\beta}_0=12.53$. A car that has fuel economy of $0$ mpg in city driving conditions can expect an average carbon footprint of $12.53$ tonnes per year. As often happens with the intercept, this is a case where the interpretation is nonsense as it is impossible for a car to have fuel economy of $0$ mpg.

The interpretation of the intercept requires that a value of $x=0$ makes sense. When $x=0$ makes sense, the intercept $\hat{\beta}_0$ is the predicted value of $y$ corresponding to $x=0$. Even when $x=0$ does not make sense, the intercept is an important part of the model. Without it, the slope coefficient can be distorted unnecessarily.

Interpreting the slope, $\hat{\beta}_1=-0.22$. For every extra mile per gallon, a car’s carbon footprint will decrease on average by 0.22 tonnes per year. Alternatively, if we consider two cars whose fuel economies differ by 1 mpg in city driving conditions, their carbon footprints will differ, on average, by 0.22 tonnes per year (with the car travelling further per gallon of fuel having the smaller carbon footprint).