11.3 The probit model

Before going on to look at response variables with other error distributions, it is worth briefly mentioning another link function that is sometimes used for proportion data - the “probit”. The probit (i.e., probability unit), is defined as:

\begin{equation} probit(p) = \Phi ^{-1}(p) \end{equation}

where $\Phi ^{-1}$ is the inverse of the cumulative normal distribution (CDF). Probits can be thought of as expressing the proportions as the number of standard deviations from the mean of a normal pdf, or, more specifically, they are the quantiles that could be obtained by using qnorm().

The use of probits was first proposed in an influential paper by [, and they have a long history in toxicological research. In most cases, results when using the probit link function will be nearly identical to those when using the logit. However, the probit does not perform as well as the logit in cases where the data tend to be sparse at intermediate levels of the predictor variable. For example, consider the following data on fathead minnow survival when exposed to six different concentrations of sodium pentachlorophenol (NaPCP). The data were originally collected by Weber and colleagues and appear as an example in [18]. At each concentration, four tanks of ten minnows were used. The data are stored in a file called “newman_example5_1.csv”.

     >  d = read.csv("newman_example5_1.csv")
     >  str(d)

    'data.frame':   24 obs. of 3 variables:
     $ rep : int 1 2 3 4 1 2 3 4 1 2 ...
     $ conc : int 0 0 0 0 32 32 32 32 64 64 ...
     $ prop_surv: num 1 1 0.9 0.9 0.8 0.8 1 0.8 0.9 1 ...

The variable rep denotes the replicates, conc is the concentration of NaPCP, and prop_surv is the proportion of individuals that survived in each tank. Let’s take a look at the proportion of individuals that died. We will call this y. To fit the probit, we just specify link = ’probit’ as an argument within the family = binomial() argument. Also notice that if the link function is not specified (as in the statement for m2 below), the default link is the logit.

     >  d$y = 1 - d$prop_surv
     >  m = lm(y ~ log(conc + 1), data = d)
     >  m2 = glm(y ~ log(conc + 1), family = binomial,
    + data = d, weights = rep(10, 24))
     >  m3 = glm(y ~ log(conc + 1), family = binomial(link = "probit"),
    + data = d, weights = rep(10, 24))
     >  plot(d$y ~ log(d$conc + 1), xlab = "Ln(Conc+1)",
    + ylab = "Proportion Dead", pch = 16)
     >  abline(m, lwd = 2)
     >  newdata = data.frame(conc = seq(0, 512, by = 0.1))
     >  lines(log(newdata$conc + 1), predict(m2, newdata,
    + type = "response"), col = "blue", lwd = 2)
     >  lines(log(newdata$conc + 1), predict(m3, newdata,
    + type = "response"), col = "red", lwd = 2)
     >  legend(0, 0.8, c("Linear", "Logit", "Probit"),
    + col = c("black", "blue", "red"), lty = c(1,
    + 1, 1), lwd = c(2, 2, 2))

Figure 1.5: Scatterplot showing a linear, logit, and probit model fit to data on fathead minnow mortality when exposed to NaPCP.

In these data (Fig.1.5), there is a relative gap between the controls and the lowest dose. As a result, the probit model tends to overestimate mortality at intermediate doses and underestimate mortality at higher doses, when compared to the logit model (although the differences are slight). Because of such differences, the logit model can be considered a more robust alternative to the probit.