11.4 The loglinear model

We have just seen how the general linear model can be extended to deal with a binary response variable. Our approach was to model the proportion of observations that belonged to one class or the other using logistic regression. Another sort of response variable that biologists frequently deal with is counts (integers). To illustrate, consider the following data on bromeliads that were collected from a tropical forest. These data were first introduced back in Section ?? . Within each of these bromeliads, a community of aquatic invertebrate species exists in a natural microcosm. All of the individuals from these communities were enumerated and identified. The hypothesis is that, as the amount of detritus that is found within these communities increases, the number of detritivore species should also increase. Thus, we want to model detritivore richness as a function of detritus. The data are in a file called “bromeliad.csv”

     >  d = read.csv("bromeliad.csv")
     >  str(d)

    'data.frame':   190 obs. of 5 variables:
     $ plant : int 1 2 3 4 5 6 7 8 9 10 ...
     $ forest : int 1 1 1 1 1 1 1 1 1 1 ...
     $ year : int 1 1 1 1 1 1 1 1 1 1 ...
     $ debris.wt: num 17 17.4 52.1 44.5 43.4 ...
     $ S.d : int 9 7 11 9 11 7 5 6 7 19 ...

We see that the data frame has five variables. S.d is detritivore richness (i.e., the number of detritivore species), debris.wt is the amount of detritus. To start, let’s fit a linear model using a log transform for the biomass of detritus (Fig.1.6).

     >  m = lm(S.d ~ log(debris.wt), data = d)
     >  summary(m)

    Call:
    lm(formula = S.d ~ log(debris.wt), data = d)

    Residuals:
     Min 1Q Median 3Q Max
    -7.8392 -2.0483 -0.2904 1.7479 8.1964

    Coefficients:
     Estimate Std. Error t value Pr( > |t|)
    (Intercept) 3.2258 0.5136 6.281 2.28e-09 ***
    log(debris.wt) 2.3236 0.1793 12.958  <  2e-16 ***
    ---
    Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 3.015 on 188 degrees of freedom
    Multiple R-squared: 0.4718, Adjusted R-squared: 0.469
    F-statistic: 167.9 on 1 and 188 DF, p-value:  <  2.2e-16

     >  plot(d$S.d ~ log(d$debris.wt), pch = 16, xlab = "log(detritus biomass)",
    + ylab = "Detritivore Species Richness", xlim = c(-3,
    + 5), ylim = c(-5, 23))
     >  lines(c(-4, 6), c(0, 0), lty = 2)
     >  abline(m)

Figure 1.6: Scatterplot of the bromeliad data showing a linear fit. The dashed line represents zero species richness, and the axes have been extended to illustrate model prediction of negative richness values.

Clearly, there is an increase in species richness as the biomass of detritus increases. However, there are some potential problems with our model. First, we know that richness, an integer, is not normally distributed. In addition, the model predicts negative values of detritivore richness. This is biologically impossible. Furthermore, inspection of diagnostic plots reveals another problem with integer response variables.

     >  par(mfrow = c(2, 2))
     >  plot(m)

Figure 1.7: Diagnostic plots from the linear model fit of the bromeliad data.

In the top left panel of Fig.1.7, we see that the cloud of residuals is shaped a bit like a ’ < ’. As the fitted values increase (x-axis), the spread in the residuals increases (width of the cloud in the direction of the y-axis). This indicates that the variance of S.d is not constant (i.e., heteroscedastic). In fact, this pattern frequently is encountered when dealing with count data. One approach to dealing with this is to log-transform the dependent variable - doing so can help resolve heteroscedasticity. However, in this case, the distribution of errors will still be non-normal.

The loglinear model allows us to deal with such data. It is a GLM with a Poisson error distribution and a log link function. Thus, we assume

\begin{equation} y_{i} \sim Poisson(\lambda _{i}) \end{equation}

and the loglinear model is:

\begin{equation} ln(y_{i})=\beta _{0}+\beta _{1}X_{1} + \beta _{2}X_{2} + \cdots + \beta _{p}X_{p}+\epsilon _{i} \end{equation}

Let’s fit this model to the bromeliad data.

     >  m2 = glm(S.d ~ log(debris.wt), family = poisson,
    + d)
     >  summary(m2)

    Call:
    glm(formula = S.d ~ log(debris.wt), family = poisson, data = d)

    Deviance Residuals:
     Min 1Q Median 3Q Max
    -2.76047 -0.72079 -0.07123 0.56187 2.33552

    Coefficients:
     Estimate Std. Error z value Pr( > |z|)
    (Intercept) 1.45353 0.06859 21.19  < 2e-16 ***
    log(debris.wt) 0.27679 0.02169 12.76  < 2e-16 ***
    ---
    Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for poisson family taken to be 1)

     Null deviance: 350.63 on 189 degrees of freedom
    Residual deviance: 175.06 on 188 degrees of freedom
    AIC: 933.98

    Number of Fisher Scoring iterations: 4

Interpreting the model summary is much the same as it was in logistic regression. The number of iteratively reweighted least squares iterations is reported as the “Number of Fisher Scoring iterations”. Wald tests are reported for each of the individual parameter estimates, and the output indicates that the loglinear model for these data is

$ln(\hat y)=$1.45$+$0.28$ \times ln(debris.wt)$.

As in logistic regression, the likelihood ratio $\chi ^{2}$ test available in the ANODEV table (see below) is a little more robust than the Wald test for the slope, especially when sample size is small. In this case, with n = 190, this is not a major concern. The dispersion parameter is again assumed to be 1. This means that we are assuming that the theoretical expectation for the variance holds. For a Poisson distribution, this expectation is is that the variance is equal to $\lambda $ (which is also the mean). Finally, information on Null and Residual deviance also is reported. Using formula 1.2, 50.1% of the variability in the number of detritivore species is accounted for by the model.

     >  anova(m2, test = "Chisq")

    Analysis of Deviance Table

    Model: poisson, link: log

    Response: S.d

    Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev P( > |Chi|)
    NULL 189 350.63
    log(debris.wt) 1 175.57 188 175.06 4.501e-40

    \begin{verbatim}
     >  plot(d$S.d ~ log(d$debris.wt), pch = 16, xlab = "log(detritus biomass)",
    + ylab = "Detritivore Species Richness")
     >  abline(m, lwd = 2)
     >  lines(c(-4, 6), c(0, 0), lty = 2)
     >  newdata = data.frame(debris.wt = seq(0.1, 150,
    + by = 0.1))
     >  lines(log(newdata$debris.wt), predict(m2, newdata,
    + type = "response"), col = "blue", lwd = 2)
     >  legend(-1.5, 20, c("Linear", "Loglinear"), col = c("black",
    + "blue"), lty = c(1, 1), lwd = c(2, 2))

Figure 1.8: Scatterplot of the bromeliad data with the linear and loglinear model fits.

In Fig.1.8, we see that the loglinear model does not give rise to negative richness predictions. In addition, because of the log link function, the model is not straight.