29  More than two levels

Chapter 27 and Chapter 28 showed you how to fit a regression model with a categorical predictor. We modelled reaction times as a function of word type (real or nonce) from the MALD dataset (Tucker et al. 2019). The categorical predictor IsWord (word type: real or nonce) was included with treatment contrasts in Chapter 27 and using indexing in Chapter 28: in the former case, the model’s intercept is the mean of the first level and the “slope” is the difference between the second level and the first; in the latter case, the model estimated the mean of the two levels of IsWord.

In this chapter we will look at new data, Voice Onset Time (VOT) of Mixean Basque (Egurtzegi and Carignan 2020), to illustrate a categorical predictor with three levels.

29.1 Mixean Basque VOT

The data egurtzegi2020/eu_vot.csv contains measurements of VOT from 10 speakers of Mixean Basque (Egurtzegi and Carignan 2020). Mixean Basque contrasts voiceless unaspirated, voiceless aspirated and voiced stops. Let’s read the data.

library(tidyverse)

eu_vot <- read_csv("data/egurtzegi2020/eu_vot.csv")
eu_vot

The VOT should increase from voiced to voiceless unaspirated to voiceless aspirated stops. We can use a Gaussian regression model to assess whether the data is compatible with our expectations. The eu_vot data has a voicing column that tells only if the stop is voiceless or voiced, but we need a column that further differentiates between unaspirated and aspirated voiceless stops. We can create a new column, phonation depending on the phone, using the case_when() function inside mutate().

case_when() works like an extended ifelse() function: while ifelse() is restricted to two conditions (i.e. when something is TRUE or FALSE), case when allows you to specify many conditions. The general syntax for the conditions in case_when() is condition ~ replacement where condition is a matching statement and replacement is the value that should be returned when there is a match. In the following code, we use case_when() to match specific phones in the phone column and based on that we return voiceless, voiced or aspirated. These values are saved in the new column phonation. We also convert the VOT values from seconds to milliseconds by multiply the VOT by 1000 in a new column VOT_ms.

eu_vot <- eu_vot |> 
  mutate(
    phonation = case_when(
      phone %in% c("p", "t", "k") ~ "voiceless",
      phone %in% c("b", "d", "g") ~ "voiced",
      phone %in% c("ph", "th", "kh") ~ "aspirated"
    ),
    # convert to milliseconds
    VOT_ms = VOT * 1000
  )

Figure 29.1 shows the densities of the VOT values for voiced, voiceless (unaspirated) and (voiceless) aspirated stops separately. Do the densities match our expectations about VOT?

Code
eu_vot |> 
  drop_na(phonation) |> 
  ggplot(aes(VOT_ms, fill = phonation)) +
  geom_density(alpha = 0.5)
Figure 29.1
Exercise 1

Recreate the following plot.

The fill legend is not really needed, since the x-axis already separates the different phonations types, but different fill colours can help with the overall legibility of the violins since they highlight the area covered by the violin shape.

To remove the legend, you should use the theme() function. Check the documentation of ?theme and search online for the argument value that hides the legend.

Have you tried legend.position in theme()?

eu_vot |> 
  drop_na(phonation) |> 
  ggplot(aes(phonation, VOT_ms, fill = phonation)) +
  geom_jitter(alpha = 0.1, width = 0.2) +
  geom_violin(alpha = 0.8, width = 0.2) +
  labs(x = "Phonation", y = "VOT (ms)") +
  theme(legend.position = "none")
Exercise 2

Calculate appropriate measures of central tendency and dispersion of VOT depending on the phonation type.

29.2 Treatment contrasts with three levels

Let’s proceed with modelling VOT. We will assume that VOT values follow a Gaussian distribution: as with reaction times, this is just a pedagogical step for you to get familiar with fitting models with categorical predictors, but other distribution families might be more appropriate. While for reaction times there are some recommended distributions (which you will learn about in XXX), there are really no recommendations for VOT, so a Gaussian distribution will have to do for now.

Before fitting the model, it is important to go through the model’s mathematical formula and to pay particular attention to how phonation type is coded using treatment contrasts. The phonation predictor has three levels: aspirated, voiced and voiceless. The order of the levels follows the alphabetical order. You will remember from Chapter 27 that the mean of the first level of a categorical predictor ends up being the intercept of the model while the difference of the second level relative to the first is the slope. With a third level, the model estimates another slope, which is the difference between the third level and the first. With treatment contrasts, the second and higher levels of a categorical predictor are compared (or contrasted) with the first level. With the default alphabetical order, this means that the intercept of the model will tell us the mean VOT of aspirated stops, and the mean of voiced and voiceless stops will be compared to that of aspirated stops.

An important aspect of treatment coding of categorical predictors that we haven’t discussed is the number of indicator variables needed: the number of indicator variables is always the number of the levels of the predictor minus one (\(N - 1\), where \(N\) is the number of levels). It follows that a predictor with three levels needs two indicator variables (\(N = 3\), \(3 - 1 = 2\)). This is illustrated in Table 29.1.

Table 29.1: Treatment contrasts coding of the categorical predictor phonation.
phonation \(ph_\text{VD}\) \(ph_\text{VL}\)
phonation = aspirated 0 0
phonation = voiced 1 0
phonation = voiceless 0 1

Now that we know how phonation is coded, we can look at the model formula.

\[ \begin{align} \text{VOT}_i & \sim Gaussian(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot ph_{\text{VD}[i]} + \beta_2 \cdot ph_{\text{VL}[i]}\\ \end{align} \]

The formula states that each observation of VOT come from a Gaussian distribution with a mean and standard deviation and that the mean depends on the value of the indicator variables \(ph_\text{VD}\) and \(ph_\text{VL}\): that is what the subscript \(i\) is for.

Exercise 3

Work out the formula of the mean VOT for each level of phonation by substituting the correct value for \(ph_\text{VD}\) and \(ph_\text{VL}\).

For example, for phonation = aspirated:

\[ \begin{align} \mu_i & = \beta_0 + \beta_1 \cdot ph_{\text{VD}[i]} + \beta_2 \cdot ph_{\text{VL}[i]}\\ & = \beta_0 + \beta_1 \cdot 0 + \beta_2 \cdot 0\\ & = \beta_0 \end{align} \]

So the mean VOT with aspirated stops is \(\beta_0\).

The code below fits a Gaussian regression model, with VOT (in milliseconds) as the outcome variable and phonation type as the (categorical) predictor. Phonation type (phonation) is coded with treatment contrasts. Before fitting the model, answer the question in Quiz 1 below.

library(brms)

vot_bm <- brm(
  VOT_ms ~ phonation,
  family = gaussian,
  data = eu_vot,
  seed = 6725,
  file = "cache/ch-regression-more-vot_bm"
)
Quiz 1
How many regression coefficients will there be in the summary of the model below?

When you run the model you will see this message: Warning: Rows containing NAs were excluded from the model.. This is really nothing to worry about: it just warns you that rows that have NAs were dropped before fitting the model. Of course, you could also drop them yourself in the data and feed the filtered data to the model. This is probably a better practice because it gives you the opportunity to explicitly find out which rows have NAs (and why).

Quiz 1
Based on the density plots of VOT you made above, which of the following sets of expectations makes sense?

You can just check the summary of the model. Does the summary meet your expectations?

summary(vot_bm)

Carefully look through the Regression Coefficients of the model and make sure you understand what each raw corresponds to. It should be clear by now that while the first coefficient, the “intercept”, is the mean VOT with aspirated stops, the second and third coefficients are the difference between the mean VOT of aspirated stops and the mean VOT of voiced and voiceless stops respectively. This falls out of the formulae you worked out in Exercise 3.

The following paragraph shows you how you could write up the model and results in a paper.

We fitted a Bayesian regression model to Voice Onset Time (VOT) of Mixean Basque stops. We used a Gaussian distribution for the outcome and phonation (aspirated, voiced, voiceless) as the only predictor. Phonation was coded using the default treatment coding.

According to the model, the mean VOT of aspirated stops is between 30 and 36 ms, at 95% probability (\(\beta\) = 33, SD = 1.5). At 95% confidence, the VOT of voiced stops is 73-81 ms shorter than that of aspirated stops (\(\beta\) = -77, SD = 1.86), while the VOT of voiceless stops is 10-16 ms shorter than that of aspirated stops (\(\beta\) = -13, SD = 1.58).

29.3 Indexing with three levels

In Chapter 28, you learned about including categorical predictors in regression models using indexing, rather than the default treatment coding. With indexing, the regression coefficients indicate the predicted mean outcome at each level of the categorical predictor. There is nothing special about predictors with three levels: you get one coefficient per level, so three in total.

Here is the mathematical formula of the model.

\[ \begin{align} \text{VOT}_i & \sim Gaussian(\mu_i, \sigma)\\ \mu_i & = \beta_1 \cdot P_{\text{A}[i]} + \beta_2 \cdot P_{\text{VD}[i]} + \beta_3 \cdot P_{\text{VL}[i]}\\ \end{align} \] The coefficients \(\beta_1, \beta_2, \beta_3\) are the mean VOT for each level of phonation.

Exercise 4

Figure out the coding table for the levels of phonation like we did in Chapter 28 for IsWord.

To fit a model of VOT by phonation using indexing, simply use the 0 + syntax, like shown in the following code.

vot_bm_i <- brm(
  VOT_ms ~ 0 + phonation,
  family = gaussian,
  data = eu_vot,
  seed = 6725,
  file = "cache/ch-regression-more-vot_bm_i"
)

Run the model and look at the regression coefficients. Do they match your expectations? It is always good to also plot the predicted values.

conditional_effects(vot_bm_i)
Figure 29.2: Predicted VOT by consonant phonation in Mixean Basque based on vot_bm_i.
Exercise 5

Write a short report of the model. Include information on the model specification, including coding of categorical predictors, and the regression coefficients. Do these results suggest three different VOT categories?