library(tidyverse)
<- readRDS("data/tucker2019/mald_1_1.rds")
mald mald
27 Regression models: categorical predictors
In Chapter 24 you learned how to fit regression models of the following form in R using the brms package.
\[ \begin{align} y & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot x\\ \end{align} \]
In these models, \(x\) was a numeric predictor, like speech rate. Numeric predictors are not the only type of predictors that a regression model can handle. Regression predictors can also be categorical: gender, age group, place of articulation, mono- vs bi-lingual, etc. However, regression model cannot handle categorical predictors directly: think about it, what would it mean to multiply \(\beta_1\) by “female” or by “old”. Categorical predictors have to be re-coded as numbers.
In this chapter and the next we will revisit the MALD reaction times (RTs) data from Chapter 22, this time modelling RTs depending on the lexical status of the target work (real vs nonce word). This chapter will teach you the default way of including categorical predictors in regression models: numerical coding with treatment contrasts; while in Chapter 28 you will learn about an alternative way: indexing.
27.1 Revisiting reaction times
Let’s read the MALD data (Tucker et al. 2019).
The relevant columns are RT
with the RTs in milliseconds and IsWord
, the type of target word: it tells if the target word is a real English word (TRUE
) or not (a nonce word, FALSE
). Figure 27.1 shows the density plot of RTs, grouped by whether the target word is real or not. We can notice that the distribution of RTs with nonce (non-real) words is somewhat shifted towards higher RTs, indicating that more time is needed to process nonce words than real words.
Code
# Set the light theme for plots
theme_set(theme_light())
|>
mald ggplot(aes(RT, fill = IsWord)) +
geom_density(alpha = 0.8) +
geom_rug(alpha = 0.1) +
scale_fill_brewer(palette = "Dark2")

You might also notice that the “tails” of the distributions (the left and right sides) are not symmetric: the right tail is heavier that the left tail. This is a very common characteristics of RT values and of any variable that can only be positive (like phonetic durations). These variables are “bounded” to only positive numbers. You will learn later on that the values in these variables are generated by a log-normal distribution, rather than by a Gaussian distribution (which is “unbounded”). For the time being though, we will model the data as if they were generated by a Gaussian distribution, for pedagogical reasons.
Another way to present a numeric variable like RTs depending on categorical variables is to use a jitter plot, like Figure 27.2. A jitter plot places dots corresponding to the values in the data on “strips”. The strips are created by randomly jittering dots horizontally, so that they don’t all fall on a straight line. The width of the strips, aka the jitter, can be adjusted with the width
argument. It’s subtle, but you can see how in the range 1 to 2 seconds there are a bit more dots in nonce words (right, orange) than in real words (left, green). In other words, the density of dots in that range is greater in nonce words than real words. If you compare again the densities in Figure 27.1 above, you will notice that the orange density in the 1-2 seconds range is higher in nonce words. These are just two ways of visualising the same thing.
|>
mald ggplot(aes(IsWord, RT, colour = IsWord)) +
geom_jitter(alpha = 0.15, width = 0.1) +
scale_colour_brewer(palette = "Dark2")

We can even overlay density plots on jitter plots using “violins”, like in Figure 27.3. The violins are simply mirrored density plots, placed vertically on top of the jittered strips. The width of the violins can be adjusted with the width
argument, like with the jitter.
|>
mald ggplot(aes(IsWord, RT, fill = IsWord)) +
geom_jitter(alpha = 0.15, width = 0.1) +
geom_violin(width = 0.2) +
scale_fill_brewer(palette = "Dark2")

Now we can obtain a few summary measures. The following code groups the data by IsWord
and then calculates the mean, median and standard deviation of RTs.
<- mald |>
mald_summ group_by(IsWord) |>
summarise(
mean(RT), median(RT), sd(RT)
)
mald_summ
The mean and median RTs for nonce words (IsWord = FALSE
) are about 100 ms higher than the mean and median of real words. We could stop here and call it a day, but we would make the mistake of not considering uncertainty and variability: this is just one (admittedly large) sample of all the RT values that could be produced by the entire English speaking population. So we should apply inference from the sample to the population to obtain an estimate of the difference in RTs that accounts for that uncertainty and variability. The inference method we are learning is using regression modelling. The next sections will teach you how to model the RTs with regression models in brms.
27.2 Treatment contrasts
We can model RTs with a Gaussian distribution (although as mentioned above, this family of distribution is generally not appropriate for variables liker RTs) with a mean \(\mu\) and a standard deviation \(\sigma\). This time though we want to model a different mean \(\mu\) depending on the word type in IsWord
. Each observation (RT value) is either of a real word or a nonce word. This is what the subscript \(i\) is for in the equation below and since there are 5000 observations, \(i = [1..5000]\) :
\[ RT_i \sim Gaussian(\mu_i, \sigma) \]
The mean \(\mu_i\) will depend on the specific observation \(RT_i\). We will see why in a second.
Now, how do we make the model estimate a different mean depending on IsWord
? There are several ways of setting this up. The default method is to use so-called treatment contrasts which involves numerical coding of categorical predictors (the coding takes the same naming as the contrasts, so the coding for treatment contrasts is treatment coding). Let’s work by example with IsWord
. This is a categorical predictor with two levels: TRUE
and FALSE
. With treatment contrasts, one level is chosen as the reference level and the other is compared to the reference level. The reference level is automatically set as the first level in the predictor in alphabetical order. This would mean that FALSE
would be the reference level, because “F” comes before “T”.
However, in the mald
data, the column IsWord
is a factor column and the levels have been ordered so that TRUE
is the first level and FALSE
the second. You can see this in the Environment panel, if you click on the arrow next to mald
and look next to IsWord
: it will say Factor w/ 2 levels "TRUE","FALSE"
. You can also check the order of the levels of a factor with the levels()
function.
levels(mald$IsWord)
[1] "TRUE" "FALSE"
You can set the order of the levels with the factor()
function. If you don’t specify the order of the levels, the alphabetical order will be used. For example:
<- tibble(
fac fac = c("a", "a", "b", "b", "a"),
fac_1 = factor(fac),
fac_2 = factor(fac, levels = c("b", "a"))
)
levels(fac$fac_1)
[1] "a" "b"
levels(fac$fac_2)
[1] "b" "a"
We will use IsWord
with the order TRUE
and FALSE
. This means that TRUE
will be the reference level and the RTs when IsWord
is FALSE
will be compared to the RTs of when IsWord
is TRUE
. In other words, now the mean \(\mu_i\) varies depending on the level of IsWord
. Now we need to numerically code IsWord
(i.e. use numbers to refer to the levels of the categorical predictor). With treatment contrasts, treatment coding is used to numerically code categorical predictors. Treatment coding uses indicator variables which take the values 0 or 1. Let’s make an indicator variable \(w_i\) that says if IsWord
is TRUE
, in which case \(w = 0\), or FALSE
, in which case \(w = 1\). We only need one indicator variable because there are only two levels and they can be coded with a 0 and a 1 respectively. This way of setting up indicator variables is called dummy coding. See Table 27.2 for the correspondence between the predictor IsWord
and the value of the indicator variable \(w\).
IsWord
.
IsWord |
\(w\) |
---|---|
IsWord = TRUE | 0 |
IsWord = FALSE | 1 |
We can now write the equation of \(\mu_i\) as in the following:
\[ \begin{align} RT_i & \sim Gaussian(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot w_i\\ \end{align} \]
\(\beta_0\) and \(\beta_1\) are two “regression coefficients”, estimated from the data by the regression model. You can think of \(\beta_0\) as the model’s intercept and of \(\beta_1\) as the model’s slope. Can you figure out why?
Recall that a regression model with a numeric predictor like the one we saw in Chapter 24 is basically the formula of a line. We modelled vowel duration as a function of speech rate. The intercept of the regression line estimated by the model indicates where the line crosses the y-axis when the x value is 0: in our case, that is the vowel duration when speech rate is 0. In the model above, with RTs as a function of word type, the numeric predictor is the indicator variable \(w\) (which stands for the categorical predictor IsWord
). The formula of a line really works just with numbers so for the formula to make sense, categorical predictors have to be made into numbers and treatment contrasts is such one way of doing so.
Now, \(\beta_0\) is the model’s intercept because that is the mean RT when \(w\) is 0 (like with vowel duration when speech rate is 0; can you see the parallel?). And \(\beta_1\) is the model’s slope because \(\beta_1\) is the change in RT for a unit increase of \(w\), which is when \(w\) goes from 0 to 1. This in turn corresponds to the change in level in the categorical predictor. Do you see the connection? It’s a bit contorted, but once you get this it should explain several aspects of regression models with categorical predictors and treatment contrasts.
So \(\mu_i\) is the sum of \(\beta_0\) and \(\beta_1 \cdot w_i\). \(w_i\) is 0 (when the word is real) or 1 (the the word is a nonce word). That is why we write \(RT_i\): the subscript \(i\) allows us to pick the correct value of \(w_i\) for each specific observation. The result of plugging in the value of \(w_i\) is laid out in the following formulae.
\[ \begin{align} \mu_i & = \beta_0 + \beta_1 \cdot w_i\\ \mu_{\text{T}} & = \beta_0 + \beta_1 \cdot 0 = \beta_0\\ \mu_{\text{F}} & = \beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1 \end{align} \]
When
IsWord
isTRUE
, the mean RT is equal to \(\beta_0\).When
IsWord
isFALSE
, the mean RT is equal to \(\beta_0 + \beta_1\).
If \(\beta_0\) is the mean RT when IsWord
is TRUE
, what is \(\beta_1\) by itself? Simple.
\[ \begin{align} \beta_1 & = \mu_\text{F} - \mu_\text{T}\\ & = (\beta_0 + \beta_1) - \beta_0 \end{align} \]
\(\beta_0\) is the difference between the mean RT when IsWord
is TRUE
and the mean RT when IsWord
is FALSE
. As mentioned above, with treatment contrasts you are comparing the second level to the first. So one regression coefficient will be the mean of the reference level, but the other coefficient will be the difference between the mean of the two levels. I know this is not particularly user-friendly, but this is the default in R when using categorical predictors.
27.3 Model RTs by word type
Fitting a regression model with a categorical predictor like IsWord
is as simple as including the predictor in the model formula, to the right of the tilde ~
. From now on we will stop including 1 +
since an intercept term is included by default.
library(brms)
<- brm(
rt_bm_1 # equivalent: RT ~ 1 + IsWord
~ IsWord,
RT family = gaussian,
data = mald,
seed = 6725,
file = "cache/ch-regression-cat-rt_bm_1"
)
Treatment contrasts are applied by default: you do not need to create an indicator variable yourself or tell brms to use that coding (which is both a blessing and a curse). The following code chunk returns the summary of the model rt_bm_1
.
summary(rt_bm_1)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: RT ~ IsWord
Data: mald (Number of observations: 5000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 953.14 6.12 941.17 965.17 1.00 4590 2897
IsWordFALSE 116.34 8.80 98.74 134.14 1.00 4815 3235
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 312.47 3.11 306.44 318.60 1.00 4720 3456
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The first few lines should be familiar: they report information about the model, like the distribution family, the formula and so on. Then we have the Regression Coefficients
table. The estimates of the coefficients and the 95% Credible Intervals (CrIs) are reported in Table 27.3.
Code
fixef(rt_bm_1) |> knitr::kable()
Estimate | Est.Error | Q2.5 | Q97.5 | |
---|---|---|---|---|
Intercept | 953.1412 | 6.120423 | 941.16925 | 965.1697 |
IsWordFALSE | 116.3413 | 8.800872 | 98.73942 | 134.1424 |
The Estimate
column is the mean of the posterior distribution of the regression coefficients, and Est.Error
is the standard deviation of the posterior distribution of the regression coefficients. Q2.5
and Q97.5
are the lower and upper limit of the 95% CrI of the posterior distribution of the regression coefficients. These are summary measures of (posterior) probability distributions. We can quickly plot these with mcmc_dens()
from the bayesplot package:
library(bayesplot)
mcmc_dens(rt_bm_1, pars = vars(starts_with("b_")))

rt_bm_1
.
b_Intercept
orIntercept
in the model summary is the mean RT whenIsWord
isTRUE
. This is \(\beta_0\) in the model’s mathematical formula.b_IsWordFALSE
orIsWordFALSE
in the model summary is the difference in mean between nonce and real words. This is \(\beta_1\) in the model’s mathematical formula.
The posterior distributions of the coefficients \(\beta_0\) and \(\beta_1\) capture the uncertainty in the estimated values of those coefficients. The 95% CrIs indicate the range of values of the posterior that cover 95% of the area under the curve of the posterior, i.e. 95% of the density mass. This means that at 95% probability, the value of the coefficient is within that range.
The 95% CrI of
Intercept
\(\beta_0\) is [941, 965] ms: this means that there is a 95% probability that the mean RT whenIsWord
isTRUE
is between 941 and 965 ms.The 95% CrI of
IsWordFALSE
\(\beta_0\) is [99, 134] ms: this means that there is a 95% probability that the difference in mean RT between nonce and real words is between 99 and 134 ms.
So here we have our answer: at 95% confidence (another way of saying at 95% probability), based on the model and data, RTs for nonce words are 99 to 134 ms longer than RTs for real words. Now is 99-134 ms a linguistically meaningful question? Statistics cannot help with that: only a well developed linguistic mathematical model of lexical retrieval and related neurocognitive processes allows us to make any statement regarding the linguistic relevance of a particular result. This aspect applies to any statistical approach, whether frequentist of Bayesian. Within a frequentist framework, “statistical significance” is not “linguistic significance”. A difference between two groups can be statistically significant and not be linguistically meaningful and vice versa.
After having obtained estimates from the model, always ask yourself: what do those estimates mean, based on our current understanding of the linguistic phenomenon we are investigating?