3.4 Normal

Both the binomial and the poisson distributions dealt with discrete data. The normal distribution, on the other hand, deals with continuous data. It is by far the most used, and arguably the most important, theoretical probability distribution in the biological sciences. In fact, many of the statistical approaches we will develop later on assume some aspect of the data being analyzed is normally distributed. So, why does this distribution seem so important? Well, many (if not most) biological variables have characteristics that tend to lead to normal distributions. For example, biological variables tend to have multiple causal factors, and those causal factors tend to be independent in occurrence. The factors also tend to be independent in effect, and they tend to contribute equally to variance. Each of these characteristics can be cast in terms of mathematical properties, and it can be shown mathematically that they will lead to variables that have a normal distribution1.

Figure 3.3: Barplots of binomial distributions with increasing $N$; $p$ is 0.5 for each. Imagine more and more bars being packed into the distribution until eventually, the bars theoretically have a width of 0.

The shape of the normal distribution is the familiar bell-shaped curve. Interestingly, it initially was developed by de Moivre in 1733 as an approximation of the binomial PMF Stigler:1999fk. Imagine a series of binomial PMFs, each with $p$ = 0.5, but with increasing $N$ (Fig.3.32). As $N$ increases, the calculation of formula \eqref{binomialpmf} becomes unwieldy. As an alternative, the height of each bar can be approximated using the normal PDF.

\begin{equation} f(y) = \frac{1}{\sigma \sqrt {2\pi }}e^{\frac{-(y-\mu )^{2}}{2\sigma ^{2}}} \label{binomialpmf} \tag{3.3} \end{equation}

which contains two parameters, the mean ($\mu $) and the standard deviation ($\sigma $). If you take this mental exercise to the extreme, theoretically, as you try to pack an infinite number of bars onto the plot, each individual bar has no width. Instead, you end up with a nice, continuous curve. However, because it is a continuous distribution (and there are an infinite number of bars), the result of the equation is no longer a relative frequency (i.e., a probability). There are so many values (i.e., bars), that the probability of any one value is essentially zero. Thus, the normal PDF does not give probabilities. To state this another way, if I randomly drew one observation from a normal distribution, the probability that it is exactly equal to any one value, say 1.56, is essentially zero. It might be 1.559999 or 1.5600001, but it will not be exactly 1.56000…(zeros go on forever). Hence, the distinction between the PMF of discrete variables and the PDF of continuous variables.

The density, cumulative distribution, quantile, and random normal values can be calculated using dnorm(x, ...), pnorm(q, ...), qnorm(p, ...), and rnorm(n, ...), respectively. We can illustrate the normal distribution using dnorm(). In the absence of other information, dnorm() will use the default values of $\mu = 0$ and $\sigma = 1$, which define the “standard normal distribution”.

Here, we used type = ’l’ to change plot type. By default, plot(), if given two vectors, will create a scatterplot. This argument instead created a line plot. We also used the argument bty to change the way plot() puts a box around the plot and lwd to change the line width. par() was used to change the plot margins.

Figure 3.4: The normal distribution with $\mu = 0$ & $\sigma = 1$, referred to as the “standard normal distribution”. Notice that the y-axis here is not $P(y)$, as in discrete PMFs. Instead, we use $f(y)$ to represent the normal density. The x-axis on the normal PDF theoretically extends to $\pm \infty $, although the values for equation \eqref{normpdf} get extremely small. In fact, 95.5% of the data will fall within $\mu \pm 2\sigma $ in any normal distribution. In the standard normal distribution, this means that 95.5% of the data fall between -2 and 2.

Given a $\mu $ and $\sigma $, the normal distribution gives us an expectation about how often we should observe certain values of $y$ (assuming random sampling). However, because $P(y=x)=0$ for any x, we are interested in determining the probability that y falls within a certain range of possible values. This concept is illustrated as areas under the normal curve. pnorm(x) reports the $P(y \leq x)$, and qnorm(p) will use the value of x that satisfies $P(y \leq x) = p$. We can use qnorm() to find the range within which we expect to see certain values of $y$. For example, what is the range within which we expect 95% of all values of $y$ to occur? Well, if 95% of the values should be within these values, then 5% should fall outside that range with 2.5% on each side (i.e., within each tail of the distribution, Fig.3.5). These values can be estimated using the following.

     >  qnorm(0.025)

    [1] -1.959964

     >  qnorm(0.975)

    [1] 1.959964

Thus, 95% of the values within a standard normal distribution will occur between -1.96 and 1.96. Stated another way,

[ 0.95 = P\left( -1.96 \leq y \leq 1.96 \right). ]

Figure 3.5: The standard normal distribution ($\mu = 0$ & $\sigma = 1$), with the $P(y \leq -1.96)$ and $P(y \geq 1.96)$ illustrated. 95% of the values of the standard normal distribution fall within these limits. In PDFs, probabilities are represented as areas under the curve.

(b) The effect of changing $\sigma $
(a) The effect of changing $\mu $

Figure 3.6: Changing the parameters of the normal PDF. Because changing $\mu $ moves the distribution along the x-axis, it is sometimes referred to as the location parameter. Because changing $\sigma $ changes the dispersion within the distribution (its width), it is sometimes referred to as the scale parameter.

Keep in mind that there are an infinite number of normal distributions, based on combinations of $\mu $ and $\sigma $ (see Fig.3.6). As a result, you will sometimes see the notation $y \sim N(\mu , \sigma )$ to indicate that $y$ is a normally distributed variable with mean $\mu $ and standard deviation $\sigma $. However, the specific normal distribution in Fig.3.4, referred to as the standard normal distribution, frequently is used as a basis for comparison. Any variable that is normally distributed can be converted to this standard normal distribution by applying

\begin{equation} z_{i} = \frac{y_{i}-\mu}{\sigma } \end{equation}

to every value of the variable. This formula is referred to as a z-transformation, and can be accomplished using scale(). The top part of the formula, subtracting $\mu $, centers the data (i.e., forces $\bar z = 0$), and dividing by $\sigma $ scales the data (i.e., forces the standard deviation of $z = 1$). Take a look at ?scale. In some analyses, we do not want the fact that variables are measured on very different scales to influence our results. The z-transformation can be useful in these circumstances.

Previously, when looking at the standard normal, we found that 95% of the values were expected to occur between -1.96 and 1.96. With the z-transformation, we can find this range for any normal distribution, as long as we know $\mu $ and $\sigma $. We want

$$ 0.95 = P\left( -1.96 \leq \frac{y - \mu }{\sigma } \leq 1.96 \right). $$

With some rearranging to isolate $y$, we have

$$ 0.95 = P\left(\mu -1.96\sigma \leq y \leq \mu +1.96\sigma \right). $$

As a simple example, if we know $y \sim N(9,5)$, what are the limits within which we expect 95% of the values of $y$ to occur?

     >  9 + qnorm(0.025) * 5

    [1] -0.79982

     >  9 - qnorm(0.025) * 5

    [1] 18.79982

or, more directly,

     >  qnorm(0.025, 9, 5)

    [1] -0.79982

     >  qnorm(0.975, 9, 5)

    [1] 18.79982

  1. Readers are referred to [ for a more thorough discussion of the normal distribution in a biological context. 

  2.  par(mfrow = c(1,4))
    probs = dbinom(0:5,size = 5, prob = 0.5)
    barplot(probs, names.arg = 0:5, main = ’N = 5)
    probs = dbinom(0:10,size = 10, prob = 0.5)
    barplot(probs, names.arg = 0:10, main = ’N = 10)
    probs = dbinom(0:25,size = 25, prob = 0.5)
    barplot(probs, names.arg = 0:25, main = ’N = 25)
    probs = dbinom(0:50,size = 50, prob = 0.5)
    barplot(probs, names.arg = 0:50, main = ’N = 50)