11.5 Overdispersion

As has been alluded to in previous sections, each of the error distributions makes an assumption about the relationship between the mean and variance (Fig.1.91). In the general linear model with normal residuals, the assumption was that the variance remained constant (i.e., we assumed homoscedasticity - see Fig.1.9a). However, in the Binomial and Poisson distributions, this is not necessarily the case.

(a) Normal
(b) Binomial
(c) Poisson

Figure 1.9: Theoretical expectations of the relationship between the variance and the mean in the Binomial and Poisson distributions. Axis labels also include the theoretical mean and variance, represented in terms of distribution parameters.

For the Binomial, we expect a hump-shaped relationship. In the Poisson, where the variance is equal to the mean, there is a 1:1 linear relationship. In fact, this is one of the reasons we would expect to see a pattern such as that observed in the first diagnostic plot in Fig.1.7. As the expected value increases, variability increases.

However, in some cases, the observed variability in data may exceed these expectations. For example, in the case of loglinear regression, it is not uncommon for the variance to increase much greater than the 1:1 relationship that is expected under the Poisson distribution. In fact, it is common for biological data to exhibit this phenomenon. In these cases, we say that the data are “overdispersed”. Unfortunately, an implication of such overdispersion is that the residual deviance tends to be inflated, thus masking the potential effects of predictor variables.

Well, how do we identify overdispersion? One approach is to compare the residual deviance to the residual df. They should theoretically be equal. However, as discussed in [ and [, this must be done with caution. Overdispersion may be an indicator that important predictor variables have not been considered, and variability in these factors may result in our data being biased. “Overdispersion happens for real, scientifically important reasons, and these reasons may throw doubt upon our ability to interpret the experiment in an unbiased way” [7]. As an example, let’s reconsider the bromeliad data. Fortunately, in this case (see the summary for model m2 in this section), the residual deviance, 175.06, is not too different than the residual degrees of freedom, 188. However, one thing that commonly leads to overdispersion in biological data is an abundance of zero counts. To illustrate, let’s randomly replace 30 observations of S.d with zeros. This can easily be done using the following code.

     >  d$S.d[sample(1:190, 30)] = 0
     >  m3 = glm(S.d ~ log(debris.wt), family = poisson,
    + d)
     >  summary(m3)

    Call:
    glm(formula = S.d ~ log(debris.wt), family = poisson, data = d)

    Deviance Residuals:
     Min 1Q Median 3Q Max
    -5.1754 -0.7284 0.2409 1.0149 2.9512

    Coefficients:
     Estimate Std. Error z value Pr( > |z|)
    (Intercept) 1.25268 0.07538 16.62  < 2e-16 ***
    log(debris.wt) 0.28483 0.02377 11.98  < 2e-16 ***
    ---
    Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for poisson family taken to be 1)

     Null deviance: 789.1 on 189 degrees of freedom
    Residual deviance: 634.0 on 188 degrees of freedom
    AIC: 1273.7

    Number of Fisher Scoring iterations: 5

Here, we see that the residual deviance is several times larger than the residual df. What to do?

Well, at this point, it is worth asking, “is there any potentially important drivers that have been left out of this analysis that could be confounding these results?” If, after careful consideration, you wish to proceed, there are options for attempting to deal with the overdispersion. R implements two distributions within glm that are useful in such situations. These are the “quasibinomial” and the “quasipoisson”. In these distributions, the dispersion parameter, $\tau $, instead of being fixed at 1, as is the case of the Binomial and Poisson, is estimated as if it were an unknown parameter. As was mentioned in section 1.2.2, the variance of the error distribution for the quasibinomial is N$\rho (1-\rho )\tau $. For the quasipossson, it is $\lambda \tau $. The dispersion parameter estimate,$\hat\tau $, then is used in the calculation of standard errors.

     >  m4 = glm(S.d ~ log(debris.wt), family = quasipoisson,
    + d)
     >  summary(m4)

    Call:
    glm(formula = S.d ~ log(debris.wt), family = quasipoisson, data = d)

    Deviance Residuals:
     Min 1Q Median 3Q Max
    -5.1754 -0.7284 0.2409 1.0149 2.9512

    Coefficients:
     Estimate Std. Error t value Pr( > |t|)
    (Intercept) 1.25268 0.11460 10.93  <  2e-16 ***
    log(debris.wt) 0.28483 0.03615 7.88 2.56e-13 ***
    ---
    Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for quasipoisson family taken to be 2.311609)

     Null deviance: 789.1 on 189 degrees of freedom
    Residual deviance: 634.0 on 188 degrees of freedom
    AIC: NA

    Number of Fisher Scoring iterations: 5

Notice that much of the output is exactly the same as we saw for the loglinear model. The only exceptions are:

  1. a value for the dispersion parameter is reported,

  2. the standard error estimates have changed and are now tested using t-tests, and

  3. the AIC is not reported.

Parameter estimates and values for deviances are unchanged. However, instead of relying on maximum likelihood estimation, we are really maximizing the “quasi-likelihood.” As a result, the ANODEV table can no longer rely on the likelihood ratio $\chi ^{2}$ test (and the AIC is undefined). Instead, an approximate F-test that uses the dispersion parameter can be requested.

     >  anova(m4, test = "F")

    Analysis of Deviance Table

    Model: quasipoisson, link: log

    Response: S.d

    Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev F
    NULL 189 789.1
    log(debris.wt) 1 155.1 188 634.0 67.096
     Pr( > F)
    NULL
    log(debris.wt) 3.879e-14 ***
    ---
    Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

The results of the F-test are consistent with the results of the t-test found in the model summary. There is a significant relationship between S.d and log(debris.wt).


  1. This figure is based, in part, on p. 511 in [