library(tidyverse)
<- read_csv("data/egurtzegi2020/eu_vot.csv")
eu_vot eu_vot
29 More than two levels
Chapter 27 and Chapter 28 showed you how to fit a regression model with a categorical predictor. We modelled reaction times as a function of word type (real or nonce) from the MALD dataset (Tucker et al. 2019). The categorical predictor IsWord
(word type: real or nonce) was included with treatment contrasts in Chapter 27 and using indexing in Chapter 28: in the former case, the model’s intercept is the mean of the first level and the “slope” is the difference between the second level and the first; in the latter case, the model estimated the mean of the two levels of IsWord
.
In this chapter we will look at new data, Voice Onset Time (VOT) of Mixean Basque (Egurtzegi and Carignan 2020), to illustrate a categorical predictor with three levels.
29.1 Mixean Basque VOT
The data egurtzegi2020/eu_vot.csv
contains measurements of VOT from 10 speakers of Mixean Basque (Egurtzegi and Carignan 2020). Mixean Basque contrasts voiceless unaspirated, voiceless aspirated and voiced stops. Let’s read the data.
The VOT should increase from voiced to voiceless unaspirated to voiceless aspirated stops. We can use a Gaussian regression model to assess whether the data is compatible with our expectations. The eu_vot
data has a voicing
column that tells only if the stop is voiceless or voiced, but we need a column that further differentiates between unaspirated and aspirated voiceless stops. We can create a new column, phonation
depending on the phone
, using the case_when()
function inside mutate()
.
case_when()
works like an extended ifelse()
function: while ifelse()
is restricted to two conditions (i.e. when something is TRUE or FALSE), case when allows you to specify many conditions. The general syntax for the conditions in case_when()
is condition ~ replacement
where condition
is a matching statement and replacement
is the value that should be returned when there is a match. In the following code, we use case_when()
to match specific phones in the phone
column and based on that we return voiceless
, voiced
or aspirated
. These values are saved in the new column phonation
. We also convert the VOT values from seconds to milliseconds by multiply the VOT by 1000 in a new column VOT_ms
.
<- eu_vot |>
eu_vot mutate(
phonation = case_when(
%in% c("p", "t", "k") ~ "voiceless",
phone %in% c("b", "d", "g") ~ "voiced",
phone %in% c("ph", "th", "kh") ~ "aspirated"
phone
),# convert to milliseconds
VOT_ms = VOT * 1000
)
Figure 29.1 shows the densities of the VOT values for voiced, voiceless (unaspirated) and (voiceless) aspirated stops separately. Do the densities match our expectations about VOT?
Code
|>
eu_vot drop_na(phonation) |>
ggplot(aes(VOT_ms, fill = phonation)) +
geom_density(alpha = 0.5)

29.2 Treatment contrasts with three levels
Let’s proceed with modelling VOT. We will assume that VOT values follow a Gaussian distribution: as with reaction times, this is just a pedagogical step for you to get familiar with fitting models with categorical predictors, but other distribution families might be more appropriate. While for reaction times there are some recommended distributions (which you will learn about in XXX), there are really no recommendations for VOT, so a Gaussian distribution will have to do for now.
Before fitting the model, it is important to go through the model’s mathematical formula and to pay particular attention to how phonation type is coded using treatment contrasts. The phonation
predictor has three levels: aspirated, voiced and voiceless. The order of the levels follows the alphabetical order. You will remember from Chapter 27 that the mean of the first level of a categorical predictor ends up being the intercept of the model while the difference of the second level relative to the first is the slope. With a third level, the model estimates another slope, which is the difference between the third level and the first. With treatment contrasts, the second and higher levels of a categorical predictor are compared (or contrasted) with the first level. With the default alphabetical order, this means that the intercept of the model will tell us the mean VOT of aspirated stops, and the mean of voiced and voiceless stops will be compared to that of aspirated stops.
An important aspect of treatment coding of categorical predictors that we haven’t discussed is the number of indicator variables needed: the number of indicator variables is always the number of the levels of the predictor minus one (\(N - 1\), where \(N\) is the number of levels). It follows that a predictor with three levels needs two indicator variables (\(N = 3\), \(3 - 1 = 2\)). This is illustrated in Table 29.1.
phonation
.
phonation |
\(ph_\text{VD}\) | \(ph_\text{VL}\) |
---|---|---|
phonation = aspirated |
0 | 0 |
phonation = voiced |
1 | 0 |
phonation = voiceless |
0 | 1 |
Now that we know how phonation
is coded, we can look at the model formula.
\[ \begin{align} \text{VOT}_i & \sim Gaussian(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot ph_{\text{VD}[i]} + \beta_2 \cdot ph_{\text{VL}[i]}\\ \end{align} \]
The formula states that each observation of VOT come from a Gaussian distribution with a mean and standard deviation and that the mean depends on the value of the indicator variables \(ph_\text{VD}\) and \(ph_\text{VL}\): that is what the subscript \(i\) is for.
The code below fits a Gaussian regression model, with VOT (in milliseconds) as the outcome variable and phonation type as the (categorical) predictor. Phonation type (phonation
) is coded with treatment contrasts. Before fitting the model, answer the question in Quiz 1 below.
library(brms)
<- brm(
vot_bm ~ phonation,
VOT_ms family = gaussian,
data = eu_vot,
seed = 6725,
file = "cache/ch-regression-more-vot_bm"
)
When you run the model you will see this message: Warning: Rows containing NAs were excluded from the model.
. This is really nothing to worry about: it just warns you that rows that have NA
s were dropped before fitting the model. Of course, you could also drop them yourself in the data and feed the filtered data to the model. This is probably a better practice because it gives you the opportunity to explicitly find out which rows have NA
s (and why).
Carefully look through the Regression Coefficients
of the model and make sure you understand what each raw corresponds to. It should be clear by now that while the first coefficient, the “intercept”, is the mean VOT with aspirated stops, the second and third coefficients are the difference between the mean VOT of aspirated stops and the mean VOT of voiced and voiceless stops respectively. This falls out of the formulae you worked out in Exercise 3.
The following paragraph shows you how you could write up the model and results in a paper.
We fitted a Bayesian regression model to Voice Onset Time (VOT) of Mixean Basque stops. We used a Gaussian distribution for the outcome and phonation (aspirated, voiced, voiceless) as the only predictor. Phonation was coded using the default treatment coding.
According to the model, the mean VOT of aspirated stops is between 30 and 36 ms, at 95% probability (\(\beta\) = 33, SD = 1.5). At 95% confidence, the VOT of voiced stops is 73-81 ms shorter than that of aspirated stops (\(\beta\) = -77, SD = 1.86), while the VOT of voiceless stops is 10-16 ms shorter than that of aspirated stops (\(\beta\) = -13, SD = 1.58).
29.3 Indexing with three levels
In Chapter 28, you learned about including categorical predictors in regression models using indexing, rather than the default treatment coding. With indexing, the regression coefficients indicate the predicted mean outcome at each level of the categorical predictor. There is nothing special about predictors with three levels: you get one coefficient per level, so three in total.
Here is the mathematical formula of the model.
\[
\begin{align}
\text{VOT}_i & \sim Gaussian(\mu_i, \sigma)\\
\mu_i & = \beta_1 \cdot P_{\text{A}[i]} + \beta_2 \cdot P_{\text{VD}[i]} + \beta_3 \cdot P_{\text{VL}[i]}\\
\end{align}
\] The coefficients \(\beta_1, \beta_2, \beta_3\) are the mean VOT for each level of phonation
.
To fit a model of VOT by phonation using indexing, simply use the 0 +
syntax, like shown in the following code.
<- brm(
vot_bm_i ~ 0 + phonation,
VOT_ms family = gaussian,
data = eu_vot,
seed = 6725,
file = "cache/ch-regression-more-vot_bm_i"
)
Run the model and look at the regression coefficients. Do they match your expectations? It is always good to also plot the predicted values.
conditional_effects(vot_bm_i)

vot_bm_i
.