3.4 Normal
Both the binomial and the poisson distributions dealt with discrete data. The normal distribution, on the other hand, deals with continuous data. It is by far the most used, and arguably the most important, theoretical probability distribution in the biological sciences. In fact, many of the statistical approaches we will develop later on assume some aspect of the data being analyzed is normally distributed. So, why does this distribution seem so important? Well, many (if not most) biological variables have characteristics that tend to lead to normal distributions. For example, biological variables tend to have multiple causal factors, and those causal factors tend to be independent in occurrence. The factors also tend to be independent in effect, and they tend to contribute equally to variance. Each of these characteristics can be cast in terms of mathematical properties, and it can be shown mathematically that they will lead to variables that have a normal distribution^{1}.
The shape of the normal distribution is the familiar bellshaped curve. Interestingly, it initially was developed by de Moivre in 1733 as an approximation of the binomial PMF Stigler:1999fk. Imagine a series of binomial PMFs, each with $p$ = 0.5, but with increasing $N$ (Fig.3.3^{2}). As $N$ increases, the calculation of formula \eqref{binomialpmf} becomes unwieldy. As an alternative, the height of each bar can be approximated using the normal PDF.
which contains two parameters, the mean ($\mu $) and the standard deviation ($\sigma $). If you take this mental exercise to the extreme, theoretically, as you try to pack an infinite number of bars onto the plot, each individual bar has no width. Instead, you end up with a nice, continuous curve. However, because it is a continuous distribution (and there are an infinite number of bars), the result of the equation is no longer a relative frequency (i.e., a probability). There are so many values (i.e., bars), that the probability of any one value is essentially zero. Thus, the normal PDF does not give probabilities. To state this another way, if I randomly drew one observation from a normal distribution, the probability that it is exactly equal to any one value, say 1.56, is essentially zero. It might be 1.559999 or 1.5600001, but it will not be exactly 1.56000…(zeros go on forever). Hence, the distinction between the PMF of discrete variables and the PDF of continuous variables.
The density, cumulative distribution, quantile, and random normal values
can be calculated using dnorm(x, ...)
, pnorm(q, ...)
,
qnorm(p, ...)
, and rnorm(n, ...)
, respectively. We can illustrate
the normal distribution using dnorm()
. In the absence of other
information, dnorm()
will use the default values of $\mu = 0$ and
$\sigma = 1$, which define the “standard normal distribution”.
Here, we used type = ’l’
to change plot type. By default, plot()
, if
given two vectors, will create a scatterplot. This argument instead
created a line plot. We also used the argument bty
to change the way
plot()
puts a box around the plot and lwd
to change the line width.
par()
was used to change the plot margins.
Given a $\mu $ and $\sigma $, the normal distribution gives us an
expectation about how often we should observe certain values of $y$
(assuming random sampling). However, because $P(y=x)=0$ for any x, we
are interested in determining the probability that y falls within a
certain range of possible values. This concept is illustrated as areas
under the normal curve. pnorm(x)
reports the $P(y \leq x)$, and
qnorm(p)
will use the value of x that satisfies $P(y \leq x) = p$.
We can use qnorm()
to find the range within which we expect to see
certain values of $y$. For example, what is the range within which we
expect 95% of all values of $y$ to occur? Well, if 95% of the values
should be within these values, then 5% should fall outside that range
with 2.5% on each side (i.e., within each tail of the distribution,
Fig.3.5). These values can be estimated using
the following.
[1] 1.959964
> qnorm(0.975)
[1] 1.959964
Thus, 95% of the values within a standard normal distribution will occur between 1.96 and 1.96. Stated another way,
[ 0.95 = P\left( 1.96 \leq y \leq 1.96 \right). ]
Figure 3.6: Changing the parameters of the normal PDF. Because changing $\mu $ moves the distribution along the xaxis, it is sometimes referred to as the location parameter. Because changing $\sigma $ changes the dispersion within the distribution (its width), it is sometimes referred to as the scale parameter.
Keep in mind that there are an infinite number of normal distributions, based on combinations of $\mu $ and $\sigma $ (see Fig.3.6). As a result, you will sometimes see the notation $y \sim N(\mu , \sigma )$ to indicate that $y$ is a normally distributed variable with mean $\mu $ and standard deviation $\sigma $. However, the specific normal distribution in Fig.3.4, referred to as the standard normal distribution, frequently is used as a basis for comparison. Any variable that is normally distributed can be converted to this standard normal distribution by applying
to every value of the variable. This formula is referred to as a
ztransformation, and can be accomplished using scale()
. The top part
of the formula, subtracting $\mu $, centers the data (i.e., forces
$\bar z = 0$), and dividing by $\sigma $ scales the data (i.e.,
forces the standard deviation of $z = 1$). Take a look at ?scale
. In
some analyses, we do not want the fact that variables are measured on
very different scales to influence our results. The ztransformation can
be useful in these circumstances.
Previously, when looking at the standard normal, we found that 95% of the values were expected to occur between 1.96 and 1.96. With the ztransformation, we can find this range for any normal distribution, as long as we know $\mu $ and $\sigma $. We want
With some rearranging to isolate $y$, we have
As a simple example, if we know $y \sim N(9,5)$, what are the limits within which we expect 95% of the values of $y$ to occur?
[1] 0.79982
> 9  qnorm(0.025) * 5
[1] 18.79982
or, more directly,
[1] 0.79982
> qnorm(0.975, 9, 5)
[1] 18.79982

Readers are referred to [ for a more thorough discussion of the normal distribution in a biological context. ↩

par(mfrow = c(1,4))
probs = dbinom(0:5,size = 5, prob = 0.5)
barplot(probs, names.arg = 0:5, main = ’N = 5’)
probs = dbinom(0:10,size = 10, prob = 0.5)
barplot(probs, names.arg = 0:10, main = ’N = 10’)
probs = dbinom(0:25,size = 25, prob = 0.5)
barplot(probs, names.arg = 0:25, main = ’N = 25’)
probs = dbinom(0:50,size = 50, prob = 0.5)
barplot(probs, names.arg = 0:50, main = ’N = 50’)