03 - Regression models

Stefano Coretta

Vowel duration

# A tibble: 887 × 57
   index speaker file     rec_date        ipu   prompt word    time sentence_ons
   <dbl> <chr>   <chr>    <chr>           <chr> <chr>  <chr>  <dbl>        <dbl>
 1     1 it01    it01-001 29/11/2016 15:… ipu_1 Dico … pugu   0.990        0.990
 2     2 it01    it01-002 29/11/2016 15:… ipu_2 Dico … pada   3.62         0.502
 3     3 it01    it01-003 29/11/2016 15:… ipu_3 Dico … poco   6.13         0.697
 4     4 it01    it01-004 29/11/2016 15:… ipu_4 Dico … pata   8.82         0.623
 5     5 it01    it01-005 29/11/2016 15:… ipu_5 Dico … boco  11.5          0.665
 6     6 it01    it01-006 29/11/2016 15:… ipu_6 Dico … podo  14.3          0.647
 7     7 it01    it01-007 29/11/2016 15:… ipu_7 Dico … boto  17.2          0.740
 8     8 it01    it01-008 29/11/2016 15:… ipu_8 Dico … paca  19.7          0.502
 9     9 it01    it01-009 29/11/2016 15:… ipu_9 Dico … bodo  22.3          0.556
10    10 it01    it01-010 29/11/2016 15:… ipu_… Dico … pucu  24.8          0.535
# ℹ 877 more rows
# ℹ 48 more variables: sentence_off <dbl>, word_ons <dbl>, word_off <dbl>,
#   v1_ons <dbl>, c2_ons <dbl>, v2_ons <dbl>, c1_rel <dbl>, c2_rel <dbl>,
#   voicing_start <dbl>, voicing_end <dbl>, voicing_duration <dbl>,
#   voiced_points <dbl>, GONS <dbl>, max <dbl>, NOFF <dbl>, NONS <dbl>,
#   peak1 <dbl>, peak2 <dbl>, c1_duration <dbl>, c1_clos_duration <dbl>,
#   c1_vot <dbl>, c1_rvoff <dbl>, v1_duration <dbl>, c2_duration <dbl>, …

Vowel durations: plot

Figure 1: Vowel duration and speech rate.

A log-normal model of vowel duration

\[ dur \sim LogNormal(\mu, \sigma) \]

dur_logn <- brm(
  v1_duration ~ 1,
  family = lognormal,
  data = durations
)

But we want to investigate the relationship between speech rate and vowel duration!

Allow \(\mu\) to vary depending on speech rate

\[ \begin{align} dur_i & \sim LogNormal(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot SR_i\\ \end{align} \]

Does the formula for \(\mu\) ring a bell?

The formula of a line

\[ y = a + b \cdot x \]

\(a\) is the line’s intercept. This is \(y\) when \(x\) is 0.
\(b\) is the line’s slope (aka gradient). This is the change in \(y\) for every unit increase of \(x\).

See Linear models illustrated.

Regression model of vowel duration

\[ \begin{align} dur_i & \sim LogNormal(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot SR_i\\ \end{align} \]

\(\beta_0\) is the intercept. This is the mean vowel duration when speech rate is 0.
\(\beta_1\) is the slope. This is the change in vowel duration for each unit increase of speech rate (syl/s).

But…

Speech rate 0 doesn’t make sense!

Speech rate cannot be zero syllables per second.

We can centre speech rate.

Subtract the mean speech rate from all the speech rate values.

Centred speech rate

mean(durations$speech_rate, na.rm = TRUE)

[1] 5.314752

durations <- durations |> 
  mutate(
    speech_rate_c = speech_rate - mean(speech_rate, na.rm = TRUE)
  )

speach_rate_c = 0 means mean speech rate.

Centred speech rate: plot

Figure 2: Vowel duration and (centred) speech rate.

Regression model of vowel durations: centred speech rate

\[ \begin{align} dur_i & \sim LogNormal(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot SR_{ctr[i]}\\ \end{align} \]

\(\beta_0\) is the intercept. This is the mean RT when centred speech rate is 0 (i.e. when speech rate is at its mean = NA).
\(\beta_1\) is the slope. This is the change in RT for each unit increase of centred speech rate (i.e for every unit increase of speech rate).

Regression model of vowel durations: code

dur_sr <- brm(
  v1_duration ~ 1 + speech_rate_c,
  family = lognormal,
  data = durations,
  cores = 4,
  seed = 1032,
  file = "data/cache/dur_sr"
)

Regression model: summary

summary(dur_sr)

 Family: lognormal 
  Links: mu = identity; sigma = identity 
Formula: v1_duration ~ 1 + speech_rate_c 
   Data: durations (Number of observations: 886) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         4.72      0.01     4.71     4.74 1.00     4388     2452
speech_rate_c    -0.23      0.01    -0.25    -0.22 1.00     4482     2982

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.22      0.01     0.21     0.23 1.00     3913     2983

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Interpreting the summary

\[ \begin{align} dur_i & \sim LogNormal(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot SR_{ctr[i]}\\ \end{align} \]

fixef(dur_sr)

                Estimate   Est.Error       Q2.5      Q97.5
Intercept      4.7230357 0.007580906  4.7081901  4.7379128
speech_rate_c -0.2344362 0.009378120 -0.2525304 -0.2157794

Intercept is \(\beta_0\): mean duration when speech rate is at mean.
speech_rate_c is \(\beta_1\): change in duration for each unit increase of speech rate.

Interpreting the summary: `Intercept`

fixef(dur_sr)

                Estimate   Est.Error       Q2.5      Q97.5
Intercept      4.7230357 0.007580906  4.7081901  4.7379128
speech_rate_c -0.2344362 0.009378120 -0.2525304 -0.2157794

The mean logged vowel duration is on average 4.72 (SD = 0.008).
There is a 95% probability that the mean logged vowel duration is between 4.71 and 4.74.

Interpreting the summary: `speech_rate_c`

fixef(dur_sr)

                Estimate   Est.Error       Q2.5      Q97.5
Intercept      4.7230357 0.007580906  4.7081901  4.7379128
speech_rate_c -0.2344362 0.009378120 -0.2525304 -0.2157794

The average change in logged vowel duration for each unit increase of speech rate is -0.23 (SD = 0.01).
In other words, for each increase of one syllable per second, the logged vowel duration decreases on average by -0.23.
We can be 95% confident that the decrease in logged duration is between -0.22 and -0.25.

Plotting the model predictions

conditional_effects(dur_sr)

Figure 3

Posterior predictive checks

pp_check(dur_sr, ndraws = 20)

Figure 4: Posterior predictive check plot of dur_sr.

Summary

Regression models are models that use the formula of a line.
A simple regression model with one numeric predictor estimates the line’s intercept (\(beta_0\)) and slope (\(beta_1\)) and the overall standard deviation (\(sigma\)).

\[ \begin{align} y_i & \sim LogNormal(\mu_i, \sigma)\\ \mu_i & = \beta_0 + \beta_1 \cdot x \end{align} \]