3.5 Central limit theorem

As indicated above, the normal distribution is important in biological settings for a variety of reasons. However, there is one more reason that the normal distribution takes on a central role in data analysis, and that reason is the Central Limit Theorem (CLT). The basic tenets of the CLT are:

  • If $y \sim N(\mu , \sigma )$, then $\bar y \sim N(\mu ,\frac{\sigma }{\sqrt {n}})$, and

  • If $y$ has finite variance, and the $y_{i}$ are random and independent observations, then $\bar y$ approaches $N(\mu ,\frac{\sigma }{\sqrt {n}})$ as $n \rightarrow \infty $.

Although it may not immediately jump off of the page, the implications of the CLT are huge. To understand them, we first need to define a new measure of dispersion, the standard error of the mean (SEM). The SEM is a measure of dispersion, not for $y$, but for $\bar y$. Implicitly, the definition of the SEM assumes that you repeatedly sample from a population, grabbing $n$ items each time. For each sample of $n$ items, a mean is calculated. The standard deviation of these sample means, then, is called the standard error of the mean. (In this discussion we are focusing on the mean, but a standard error is the standard deviation of any statistic, assuming repeated sampling.) The parametric value of the SEM is represented by $\sigma_{\bar y}$, and the sample value is $s_{\bar y}$.

The first part of the CLT states that, if you repeatedly sample a normal distribution, each time calculating a sample mean, the standard deviation of this new distribution of means is going to be $\sigma / \sqrt {n}$. Thus, to get the SEM, we do not literally have to go through the process of repeated sampling. It also states that the parametric mean of this new distribution of means will be the same parametric mean of $y$, namely $\mu $. Even more important, the second part the CLT states that, regardless of the distribution of $y$, as long as the observations are random and independent, the distribution of $\bar y$ (created by repeatedly sampling $y$) will tend towards a normal distribution as the sample size (that is, $n$, the sample size for each $\bar y_{i}$), increases.

Let’s demonstrate the basic tenets of the CLT with a simple exercise. In the following code, a very non-normal population will be created. In fact, the population will be created from a uniform distribution1. We will then sample this population 1000 times, each time calculating a mean. The distribution of these means will then be plotted. (Some sample results are shown in Fig.3.7.)

     >  y = runif(1000)
     >  par(mfrow = c(2,1))
     >  hist(x)
     >  out = vector()
     >  for (i in 1:1000){
     >   out_i = mean(sample(y, size = 3))
     >   out = c(out, out_i)
     >   }
     >  hist(out)    

(a) The distribution of $y$.
(b) Distributions of $\bar y$, illustrating the effect of changing $n$.

Figure 3.7: Illustration of the Central Limit Theorem. Given an original distribution of $y$, which is non-normal, the distribution of $\bar y$ becomes increasingly normal as $n$ increases. Each of the distributions of $\bar y$ was created by repeatedly sampling the distribution of $y$ 1000 times.

At this point, we will not go into detail about every aspect of the code. However, it does illustrate the fact that, because R is an actual language (and not just an analysis program), we can write a for loop to do the job of repeatedly sampling $y$, which follows a poisson distribution. We used sample() as an argument to mean(). sample() told R to randomly grap a sample of 3 items from $y$.

Change the sample size to see how it influences the degree to which the $\bar y$ values (stored in out) appear to be normally distributed.

Simply stated, the implication of the CLT is that the distribution of means will tend towards normality as sample size increases. This, in turn, allows us to quantify expectations about mean values using the normal distribution without having to worry so much about the original distribution of our data.

  1. The density, cumulative distribution, quantile, and random values for a uniform distribution can be calculated using dunif(x, ...), punif(q, ...), qunif(p, ...), and runif(n, ...), respectively. In the uniform distribution, all possible values are equally likely.