Chapter 9 ANCOVA & Model Selection
ANCOVA is multiple linear regression with both categorical and continuous explanatory variables.
Model selection is a set of guidelines for choosing the right model. The best model depends on what the model is intended to be used for.
9.1 Introduction
We now turn our attention to to the case where we have at least one numerical and at least one factor as explanatory variable. These type of models we call ANCOVA or ANalysis of COVAriance and a simple example is the model Y ~ F + X + F:X
with F
a factor and X
a numeric variable.
Most of the ground work has been covered in the screencasts on one-way ANOVA, linear regression and factorial ANOVA. Therefore this set of screencasts on ANCOVA are more applied, following the same steps as the factorial ANOVA and the differences are emphasized where needed. Try to identify the similarities and differences between these multiple regression models.
The first screencast discusses the difference between the interpretation of an ANCOVA model as compared to an factorial ANOVA model. The second screencast runs trough the analysis of the of an ANCOVA step by step.
9.2 The ANCOVA explained by simulations
Simulations are used to explain the challenges faced when analyzing ANCOVA; when are interaction’s significant in the ANCOVA context and what does this mean. First we simulate the same model without interaction multiple times using different set.seed()
values to the investigate the uncertainty in the estimates of the slopes. Then we simulate a model with interaction.
9.3 The analysis of ANCOVA data step by step
An ANCOVA analysis is performed step by step follow the tutorial for ANCOVA. In this screencast the differences with a factorial ANOVA are emphasized.
9.3.1 Exercises
Below part of the output of a ANOVA analysis.
Residuals:
Min 1Q Median 3Q Max
-0.8528 -0.3010 0.0563 0.2708 0.8555
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.09856 0.36516 13.963 4.25e-11 ***
TreatB 1? 0.51641 -1.405 0.17712
TreatC -1.30794 2? -2.533 0.02084 *
Concentration 0.06073 0.08729 0.696 0.49547
TreatmentB:Concentration 0.43544 0.12345 ?3 0.00241 **
TreatmentC:Concentration 0.24936 0.12345 2.020 0.05852 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5657 on 18 degrees of freedom
Multiple R-squared: 0.7831, Adjusted R-squared: 0.7229
F-statistic: 13 on 5 and 18 DF, p-value: 1.892e-05
- Calculate the missing numbers
?1
,?2
and?3
. - How much of the variance is explained by this model?
- What was the total sample size of this experiment?
- Based on this output, do you think there is an interaction? Explain.
- What is the estimate for the intercept of
TreatA
? What aboutTreatB
andTreatC
? - Draw what this model would look like. You may use R, or pen and paper.
9.4 Model Selection
9.4.1 Exercises
A set of exercises can be downloaded here and its required data set here.
(If you can’t knit, click here for a PDF version of the exercises.)
9.4.2 Exercises (hard)
In order to study the effect of caviar on health, a survey is distributed in a city on whether the respondents have ever eaten caviar, and if so, how frequently they eat caviar. After receiving the responses, 100 random individuals from each of the following groups are invited for a health check-up:
- Group A: Has never consumed caviar;
- Group B: Has tried caviar once, or a few times in their life;
- Group C: Eats caviar about once per year;
- Group D: Eats caviar multiple times per year;
- Group E: Eats caviar about once per month.
What problems do you think there might be with this study design? (HINT: There are at least two major flaws.)
Can you come up with a better design for the research question?
Can you come up with a minimal sample size needed to conduct this research? You can estimate the minimal required sample size for both the original design and your own version.