Chapter 12 GLM
A model for a response variable and any number of explanatory variables (continuous and/or categorical). The data-generating process can follow any distribution in the exponential dispersion family.
12.1 Preparation
To properly understand the generalized linear model, you should by now be familiar with:
If you want to prepare for this chapter before the lectures on GLM, I recommend revisiting those subjects.
12.2 Introduction
Data do not always abide to the assumptions of our statistical models. This might be due to extreme values in our data, which might be outliers, or due to bad experimentation. But there are also many instances where the deviation from the assumptions are due to more data fundamental properties. We have observed this in the fruitfly
and rats
data sets. Longevity and survival can be interpreted as the time we need to wait until the fruit flies or rats dies. Most often waiting times are not distributed as a normal distribution. Counting processes also result in non-normal distributions (Poisson).
The same is true if our response variable is not numeric but for instance a factor with two levels; e.g. the present or absence of a particular object, success or failure, dead or alive, sprouting or not, gene on or off, to name a few. These two-valued response data are called binomial data and the related distribution is the binomial distribution.
Waiting times, counts and binomial data occur commonly in life science research. In order to accommodate for the analysis of these data, we need to generalize the linear model in such a way that will allow for the statistical analysis of Poisson and binomial data and keeps the ease of interpretation of the linear models. The method that was developed is called the generalized linear model, or GLM.
Below we will discuss two instances of GLMs, one for Poisson and one for binomial data. Beside a small addition in the coding, the GLM transforms the model rather that the response variable, this will allow us to give a standard interpretation to intercepts and in particular the slopes.
GLM is also discussed in chapter 3 of the book Elements of Biostatistics.
12.3 Poisson GLM
In this first screencast we give the motivation for choosing a GLM approach to Poisson data. I also discuss the link between the different distributions. It turns out that the three distributions normal, Poisson, and binomial, all belong to the same family of distribution, the exponential dispersion family. It’s important to note that both the Poisson and binomial distributions have only one free parameter: \(\lambda\) and \(p\), respectively, whereas the normal distribution has two: \(\mu\) and \(\sigma\).
Question:
- How are distribution parameters of the normal distribution, \(\mu\) and \(\sigma\), related to the parameter of the Poisson distribution \(\lambda\) and the parameters of the binomial distribution, \(p\) and \(n\)?
12.3.1 Model specification
Because GLM models can deal with the different distributions from the exponential family, we need to make explicit which family member we need to select, e.g. Poisson for counts and binomial for two level factors. One example for a GLM encoded in R:
Model <- glm(Counts ~ Treatment + Concentration + treat:Concentration, family = poisson(link = "log")
Note that you can identify:
- Linear equation:
Counts ~ Treatment + Concentration + treat:Concentration
- Exponential family member:
poisson
- Link function:
link = "log"
Further note that we still have a linear model. Therefore we will run through all the steps that we did in the analysis of linear models.
Question:
- Name three link function for the Poisson distribution.
12.3.2 Analysis of GLM data and interpretation
In the GLM we transform the model in order to to get the residuals to behave as a normal distribution again. The assumption of homoscedasticity is not relevant for GLM, as heteroscedasticity is a property of the Poisson distribution. Other than that, the steps are mostly the same as with the linear models you have seen before.
To understand the estimate of the slope in a Poisson GLM we need to transform it back, e.g. from log
(which is the natural logarithm in R) using the exponential function. This introduces a nonlinear behavior, a curve, and the slope now becomes the rate of the change of the tangent line to the curve.
Question:
- A slope smaller than 1, would that result in increasing function (curve) or a decreasing function?
- What is overdispersion?
12.4 Binomial GLM
We use the a binomial GLM if our response data consist of two level factor. It follow the same steps as the binomial GLM, however the intercepts and slopes are interpreted differently. In this first screencast we discuss the motivation behind using the binomial GLM.
Questions:
- What is the main motivation to use binomial GLM?
12.4.1 Analysing a binomial GLM
In analyzing GLM data we use the same steps as before. However, it become difficult to perform good model diagnostics, either due to the small number of observation because the number of success is reduced to one number or as the two-value nature of the response variable, as will be explained below.
Question:
- If you prepare and experiment for the analysis with a binomial experiment, in what range of the numerical response variable should you have enough data?