Linear models: a cheat-sheet
One model to rule them all
Linear models, aka linear regression models or regression models, are a group of statistical models based on the simple idea that we can predict an outcome variable \(Y\) based on a function \(f(X)\).
The “simplest” linear model is the formula of a line:1
\[ y = \alpha + \beta x \]
where \(\alpha\) is the intercept of the line and \(\beta\) the slope.
The principles behind this formula can be extended to represent virtually any other model, independent of the nature of the outcome variable(s) (\(y\)), the predictor(s), the types of relationship between outcome and predictor, and so on.
This means that if you master the principles of linear models, then you can virtually fit any kind of data using linear models You can bid farewell to ANOVAs, \(t\)-tests, \(\chi^2\)-tests, and what not. In fact, these can all be thought of as specific cases of linear models. It just so happens that they got themselves a specific name. But the underlying mechanics is the same.
Same goes with “regressions”, “logistic regression”, “generalised linear models”, “mixed-effects regression” and so on. These are all linear models, so they all follow the same principles. And again, the fact that they got specific name is a historical “accident”.
Understanding that these named models are in fact all linear models gives you super powers you can use on data (Sauron would be so jealous):
One model to rule them all, one model to fit them,
One model to shrink them all, and in probability bind them;
In the Land of Inference where the distributions lie.
Ehm… perhaps this is not gonna win a poetry context, but…
The message is that with a single tool, i.e. linear models, you can go a long way!
Each of the following sections asks you about the nature of your data and/or experimental design. By answering each, you will find out which “pieces” you need to add to your model structure.
(This is a work in progress, so still rough around the edges).
Step 0: Number of outcome variables
We will get back to this step at the end of this post, since it makes things a bit more complex.
Step 1: Choose a distribution for your outcome variable
The first step towards building a linear model is to choose the family of distributions you believe the outcome variable belongs to.
You can start by answering this question: is your outcome variable continuous or discrete?
Continuous outcome variable
The variable can take on any positive and negative real number, including 0: Gaussian (aka normal) distribution.
There are very few truly Gaussian variables, although in some cases one can speak of “approximate” or “assumed” normality.
This family is fitted by default in
lm()
,lme4::lmer()
andbrms::brm()
.
The variable can take on any positive number only: Log-normal distribution.
Duration of segments, words, pauses, etc, are known to be log-normally distributed.
Measurements taken in Hz (like f0, formants, centre of gravity, …) could be considered to be log-normal.
There other families that could potentially be used depending on the nature of the variable: exponential-Gaussian (reaction times), gamma, …
The variable can take on any number between 0 and 1, but not 0 nor 1: Beta distribution.
- Proportions fall into this category (for example proportion of voicing within closure), although 0 and 1 are not allowed in the beta distribution.
The variable can take on any number between 0 and 1, including 0 or 0 and 1: Zero-inflated or Zero/one-inflated beta (ZOIB) distribution.
- If the proportion data includes many 0s and 1s, then this is the ideal distribution to use. ZOIB distributions are somewhat more difficult to fit than a simple beta distribution, so a common practice is to transform the data so that it doesn’t include 0s nor 1s (this can be achieved using different techniques, some better than others).
Discrete outcome variable
The variable is dichotomous, i.e. it can take one of two levels: Bernoulli distribution.
Categorical outcome variables like yes/no, correct/incorrect, voiced/voiceless, follow this distribution.
This family is fitted by default when you run
glm(family = binomial)
, aka “logistic regression” or “binomial regression”.
The variable is counts: Poisson distribution.
- Counts of words, segments, gestures, f0 peaks, …
The variable is a scale: ordinal linear model.
Step 2: Are there hierarchical groupings and/or repeated measures?
The second step is to ensure that, if the data is structured hierarchically or repeated measures were taken, this is taken into account in the model. Here is where so-called “random effects” or “group-level effects/terms” come in.
As an example, let’s assume you asked a number of participants to read a list of words and each word was repeated 5 times by each participant. You then took f0 measurements from the stressed vowel of each word, of each repetition.
Now, the data has a structure to it:
- First, observations are grouped by participant (some observations belong to one participant and others to another and so on).
- Second, observations are grouped by word (some observations belong to one word and others to another and so on).
- Third, within the observations of each word, some belong to the same participant (or, from a different perspective, within the observations of each participant, some belong to the same word).
The presence of “groupings” within the data (whether they come from natural groupings like participant or word, or from repeated measures) breaks one of the assumptions of linear models: that each observation must be independent.
If you don’t include any random effect/group-level terms, your model will expect that each observation is independent and hence it will underestimate variance and return unreliable results.
In the toy-example of f0 measurements, you will want to include group-level terms for participant and word. These will take care to let the model know of the structure of the data mentioned above.
If you have other predictors in the model, you should also add them as (random) slopes in the random effects/group-level terms. For example: (question | participant) + (question | word)
(where question
= statement vs question).
A quick terminological note: models that include random effects/group-level terms are called:
- Random-effects models.
- Mixed-effects models.
- Hierarchical models.
- Nested models.
- Multilevel models.
These terms are for all intents and purposes equivalent (it just happens that different traditions uses different terms).
Step 3: Are there non-linear effects?
A typical use-case of non-linear terms is when you are dealing with time-series data or spatial data (i.e. geographic coordinates).
Generalised Additive Models allow you to fit non-linear effects using so called “smooth” (or “smoother”) terms.
You can fit a linear model with smooth terms with brms (simply add smooth terms to the formula) or with mgcv (using gam()
or bam()
), among others.
Step 0-bis: Number of outcome variables
If you want to model just one outcome variable, you are already covered if you went through steps 1-3.
If instead your design has two or more outcome variables (for example F1 and F2, or duration of the stressed and unstressed vowel of a word) then you want to fit a multivariate model (i.e. a model with multiple outcome variables*).
The same steps we went through before can be applied to multiple outcome variables. In some cases, you will want to use the same model structure for all the outcome variables, while in others you might want to use a different model structure for each.
To learn more about multivariate models, I really recommend Paul Bürkner’s vignette Estimating Multivariate Models with brms.
Footnotes
Technically, the “simplest” linear model is \(y = f(x)\), but oh well…↩︎