Chapter 2 Random Variables

In this chapter, the concept of a random variable, degrees of freedom and probability distributions is explained.

2.1 What are Random Variables?

In empirical science, we draw conclusions about some larger population, using only a limited sample. This process is called inference, and it draws heavily from the definition of a random variable. Though this chapter may seem somewhat abstract, it is important that you have at least some idea what a random variable is.

In experiments, random samples are taken to extract particular properties from a population of interest; e.g. we pick randomly 20 students from the population of GRS students in 2021. We will discuss this in more detail in day 2 during the lecture on Experimental Design, but for now assume, as an example, that we are able to pick 20 individuals at random from the population of GRS students from Leiden.

In the natural sciences, we might not be interested in these individuals per se, but rather in some properties of interest of the population that can be inferred from this group of 20 individuals; e.g. their height, age, or eye color. In all of these cases this property can be translated into a numeric values; e.g. 180 cm of height, 10 years since birth or enrollment to the university (academic age), and an integer representing the number of brown eyes in our sample (numbers 0–20). The numbers generated by these experiments are called random numbers, and the variable that they represent, like height or the number of brown eyes is called a random variable.

Some properties of Random variable of interest relevant in this to this course:

We cannot predict the value of a random variable with absolute precision.
- Every new sample generates new random variable values.
- e.g. new samples of 20 students can give different numbers of brown eyes.
Functions base on random variables are also random variables.
- The function that calculates the mean takes random variables as input and therefore is also a random variable.
- New samples can give different means.
Because random variables can take different values, there is a distribution associated with it.
- The distribution of the ages of biology students in Leiden.
- The distributions of the sample means if we repeat our experiment (important notion in statistics).
- The distribution of number of brown eyes in sample of 20 students GRS.

In the screencast below, I discuss random variables in more detail:

To summarize:

Random variables (RV) are the starting point of probability theory and statistics.
- Statistics is built upon a solid foundation of mathematics.
The number of RVs can sometimes be expressed in degrees of freedom if statistics like the mean are involved in a calculation.
Statistics like the mean, variance, and standard deviation (SD) are random variables, and have their own distributions and statistics; e.g. you can estimate the SD of the distribution of means.
Statistics are estimates, how good are our estimates? This can be expressed through the standard error (SE).
Can we extract information from random variables?
- Yes, and you will learn this in this course!

The probability theoretical basis is solid, and not vague, but exact.
- Vagueness in statistics is due to the quality of data (experimental design).

Next we address the problem of how we can extract information from random variables. It turns out that statistics has found a way to partition a random variable in a systemic and informative part, and a random part called the error. This will introduce the concepts of degrees of freedom of the statistical model and the random stochastic error.

2.1.1 Exercise

Describe random variables in your own words and give some examples of random variables.

2.2 Degrees of Freedom

Random variables, degrees of freedom and statistical models.

Extracting information from random variables is the main purpose of statistics. By constructing a statistical model that has a close relationship with the hypothesis to be tested (topic of tomorrow), statistics can extract information from data.

We start from the idea of the degrees of freedom of a model, and show its relation with the number of observations, thereby introducing the idea of a statistical model and the motivation for applying statistics.

To summarize:

If \(n=\text{df}_{\text{model}}\), we have an perfect fit;
- \(\text{df}_\text{residuals}=0\) is also known as a saturated model.
The RHS of a statistical model, when \(n > \text{df}_\text{model}\), has two parts:
- A deterministic part: e.g. \(\text{BD} = \mathbf{\text{intercept} + \text{slope} \cdot \text{conc}} + \text{residuals}\), which is a formal representation of our hypothesis and is informative.
- A stochastic or random part: e.g. \(\text{BD} = \text{intercept} + \text{slope} \cdot \text{conc} + \mathbf{\text{residuals}}\), representing the properties of the random variable which is non-informative.
The residuals represent our ignorance and uncertainty.
- Ignorance of this type leads to uncertainty: We can have a model for cell division predicting that the cell divides on average after 5 minutes, but will it divide after 5 minutes or will it be earlier or later?
- The level of our ignorance will affect the level of our confidence in our model.

As the data that we sample plays such a crucial role, we must take care when acquiring our data. This is the field of experimental- or study design, the topic of our next lecture tomorrow.

2.2.1 Exercises

Describe, in your own words, the relation between a random variable, degree of freedom and a statistical model;
Do you agree that the source of vagueness relates to the quality of the data? Write down a argument in favor or against the above statement.

2.3 Probability Distributions

A probability distribution describes the chance of different outcomes of a random variable. Also see chapter 6 of Introduction to Biostatistics.

What is the chance of being taller than 2 meters? Or what is the highest grade you can expect from random guessing on a multiple choice exam? These are questions that can be answered using a probability distribution. In the video below, three commonly used distributions are explained at a conceptual level. You don’t have to memorize their probability density functions or cumulative density functions for this course.

To summarize:

A normal distribution is used to model continuous outcomes with a central tendency;
A Poisson distribution is used to model independent counts;
A binomial distribution is used to model binary data and ratios;
All of these distributions are only realistic after accounting for any large, structural differences, and not for typical data;
How to account for systematic differences is what you will learn in the lectures on statistical models, particularly regression analysis.

2.3.1 Exercises

Why is the normal distribution a good approximation for adult male height in a given year, in a given country?
Would it still be a good approximation for adults in a given country? Explain.
What does it mean that for the Poisson distribution, the mean is equal to the variance?
What is skew? Can the normal distribution be skewed? How about the Poisson distribution, or the binomial distribution? Explain.

New (alpha version). We are trying a new concept where you can pick your own set of exercises depending on your interests as a biology student. We have four categories for you to pick from:

A—Molecular, Cellular & Medical Biology (exercises);
B—Ecology, Evolution & Behavioral Biology (exercises);
C—Computational Biology and Bioinformatics (exercises);
D—Education & Science Communication (exercises).

Pick only one, download the exercises (PDF) and do the exercises.