24  Regression models

In Chapter 23 you were introduced to regression models. Regression is a statistical model based on the equation of a straight line, with added error.

\[ y = \beta_0 + \beta_1 \cdot x + \epsilon \]

\(\beta_0\) is the regression line’s intercept and \(\beta_1\) is the slope of the line. We have seen that \(\epsilon\) is assumed to be from a Gaussian distribution with mean 0 and standard deviation \(\sigma\).

\[ \begin{align} y & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot x\\ \end{align} \]

From now on, we will use the latter way of expressing regression models, because it makes it clear which distribution we assume the variable \(y\) to be generated by (here, a Gaussian distribution). Note that in the wild, variables very rarely are generated by Gaussian distributions. It is just pedagogically convenient to start with Gaussian regression models (i.e. regression models with a Gaussian distribution as the distribution of the outcome variable \(y\)) because the parameters of the Gaussian distribution, \(\mu\) and \(\sigma\) can be interpreted straightforwardly on the same scale as the outcome variable \(y\): so for example if \(y\) is in centimetres, then the mean and standard deviation are in centimetres, if \(y\) is in Hz, then the mean and SD are in Hz, and so on. Similarly, the regression \(\beta\) coefficients will be on the same scale as the outcome variable \(y\). You will be introduced later to regression models with distributions other than the Gaussian, where the regression parameters are estimated on a different scale than that of the outcome variable \(y\).

The goal of the Gaussian regression model expressed in the formulae above is to estimate \(\beta_0\), \(\beta_1\) and \(\sigma\) from observed data. Now, since truly Gaussian data is difficult to come by, especially in linguistics, for the sake of pedagogical simplicity we will start the learning journey on fitting regression models using data for which a Gaussian regression is generally not appropriate. You will learn in later chapters more appropriate distribution families for this data.

24.1 Vowel duration in Italian: the data

We will analyse the duration of vowels in Italian from Coretta (n.d.) and how speech rate affects vowel duration. The expectation we might have is that vowels get shorter with increasing speech rate. You will notice how this is a very vague hypothesis: how shorter do they get? Is the shortening the same across all speech rates, or does it get weaker with higher speech rates? Our expectation/hypothesis simply states that vowels get shorter with increasing speech rate. Maybe we could do better and use what we know from speech production and come up with something more precise, but this type of vague hypothesis are very common, if not standard, in language research, so we will stick to it for practical and pedagogical reasons. Remember, however, that robust research should strive for precision.

Let’s load the R data file coretta2018a/ita_egg.rda. It contains several phonetic measurements obtained from audio and electroglottographic recordings. You can find the information on the data here: Electroglottographic data on Italian.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
load("data/coretta2018a/ita_egg.rda")

ita_egg

Let’s plot vowel duration and speech rate in a scatter plot. The relevant columns in the tibble are v1_duration and speech_rate. TODO: check that geom_smooth was done earlier. Figure 24.1 shows speech rate on the x-axis and vowel duration on the y-axis. The points in the plot are the individual observations (measurements) of vowels in the 19 speakers of Italian. The plot also includes a regression line, generated by geom_smooth(method = "lm"). By glancing at the individual points, we can see a negative relationship between speech rate and vowel duration: vowels get shorter with greater speech rate. This is reflected by the regression line, which has a negative slope.

ita_egg |> 
  ggplot(aes(speech_rate, v1_duration)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(
    x = "Speech rate (syllables per second)",
    y = "Vowel duration (ms)"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 24.1: Relationship between speech rate (as number of syllables per second) and vowel duration (in milliseconds) in 19 Italian speakers.

You might have noticed a warning about missing values. This is because some observations of vowel duration (v1_duration) in the data are missing (i.e. they are NA, “Not Available”). Let’s drop them from the tibble with drop_na(). We will use ita_egg_clean for the rest of the tutorial.

ita_egg_clean <- ita_egg |> 
  drop_na(v1_duration)

24.1.1 The model

Let’s move on onto fitting a Gaussian regression model to vowel duration as the outcome variable and speech rate as the predictor. We are assuming that vowel duration follows a Gaussian distribution (although as mentioned above this is not the case, but it will do for now). Here is the model we will fit in mathematical notation.

\[ \begin{align} \text{vdur} & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot \text{sr}\\ \end{align} \]

  • Vowel duration (\(\text{vdur}\)) is distributed (\(\sim\)) according to a Gaussian distribution (\(Gaussian(\mu, \sigma)\)).

  • The mean \(\mu\) is equal to the sum of \(\beta_0\) (the intercept) and the product of \(\beta_1\) and speech rate (\(\beta_1 \cdot \text{sr}\)). The formula of \(\mu\) is the equation of a straight line, aka the linear equation. This is why regression models are also called linear models.

The regression model estimates the parameters in the mathematical formulae: parameters to be estimated in regression models are usually represented with Greek letters (hence why we adopted this notation for the linear equation). Since \(\mu\) is the sum of terms with parameters, the model estimates those parameters directly. So in total, the regression model represented in the formulae above has to estimate the following three parameters:

  • The regression coefficients \(\beta_0\) and \(\beta_1\).

  • The standard deviation of the Gaussian distribution, \(\sigma\).

To instruct R to model vowel duration as a function of the numeric predictor speech rate you simply add it to the 1 we have used in the right-hand side of the tilde in Chapter 22 (i.e. v1_duration ~ 1): so v1_duration ~ 1 + speech_rate. Now it should be clear what the 1 is for: it stands for the intercept, \(\beta_0\). You can also think about it as the \(1\) in the revised formula below:

\[ \begin{align} \text{vdur} & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 \cdot 1 + \beta_1 \cdot \text{sr}\\ \end{align} \]

Of course, \(\beta_0 \cdot 1\) is the same as \(\beta_0\). The R formula is based on what you multiply the coefficients with in the mathematical formula: \(1\) and \(sr\). While the predictor \(sr\) can take different values, the \(1\) is constant so it is also called the constant term, or the intercept term. In the R formula, you don’t need to explicitly include the coefficients \(\beta_0\) and \(\beta_1\). Put all this together and you get 1 + speech_rate. There is more: in R, since the \(1\) is constant you can omit it! So v1_duration ~ 1 + speech_rate can also be written as v1_duration ~ speech_rate. They are equivalent.

Now that we have clarified how the R formula is set up, here is the full code to fit a Gaussian regression model of vowel duration with brms.

library(brms)

vow_bm <- brm(
  # `1 +` can be omitted.
  v1_duration ~ 1 + speech_rate,
  # v1_duration ~ speech rate,
  family = gaussian,
  data = ita_egg_clean
)

XXX

24.2 Interpret the model summary

To obtain a summary of the model, use the summary() function.

summary(vow_bm)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: v1_duration ~ speech_rate 
   Data: ita_egg_clean (Number of observations: 3253) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     198.47      3.33   191.76   204.97 1.00     3681     2274
speech_rate   -21.73      0.62   -22.93   -20.49 1.00     3623     2421

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    21.66      0.27    21.14    22.19 1.00     3855     2615

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s focus on the “Regression Coefficients” table of the summary. To understand what they are, just remember the equation of line and the model formula above.

  • Intercept is \(\beta_0\): this is the mean vowel duration, when speech rate is 0.

  • speech_rate is \(\beta_1\): this is the change in vowel duration for each unit increase of speech rate.

This should make sense, if you understand the equation of a line: \(y = \beta_0 + \beta_1 x\). If you are still uncertain, play around with the web app.

Recall that the Estimate and Est.Error column are simply the mean and standard deviation of the posterior probability distributions of the estimate of Intercept and speech_rate respectively.

Looking at the 95% Credible Intervals (CrIs), we can say that based on the model and data:

  • The mean vowel duration, when speech rate is 0 syl/s, is between 192 and 205 ms, at 95% confidence.

  • We can be 95% confident that, for each unit increase of speech rate (i.e. for each increase of one syllable per second), the duration of the vowel decreases by 20.5-23 ms.

To see what the posterior probability densities of \beta_0, \beta_1 and \sigma look like, you can quickly plot them with the plot() function.

plot(vow_bm)

If you prefer to see density plots instead of histograms, you can specify the combo argument.

plot(vow_bm, combo = c("dens", "trace"))

24.3 Plot the model predictions

You should always also plot the model predictions, i.e. the predicted values of vowel duration based on the model predictors (here just speech_rate).

You will learn more advanced methods later on, but for now you can use conditional_effects() from the brms package.

conditional_effects(vow_bm, effects = "speech_rate")

If you wish to include the raw data in the plot, you can wrap conditional_effects() in plot() and specify points = TRUE. Any argument that needs to be passed to geom_point() (these are all ggplot2 plots!) can be specified in a list as the argument point_args. Here we are making the points transparent.

plot(conditional_effects(vow_bm, effects = "speech_rate"), points = TRUE, point_args = list(alpha = 0.1))

24.4 Reproducible model fit

XXX Before you fit the model, create a folder called cache/ folder in the RStudio project folder. We will use this folder to save the output of model fitting so that you don’t have to refit the model every time. This is useful because as models get more and more complex, they can take quite a while to fit.

The model will be fitted and saved in cache/ with the file name vow_bm.rds. If you now re-run the same code again, you will notice that brm() does not fit the model again, but rather reads it from the file (no output is shown, but trust me, it works! Check the contents of cache/ to see for yourself.).

Important

When you save the model fit to a file, R does not keep track of changes in the model specification, so if you make changes to the formula or data, you need to delete the saved model file before re-running the code for the changes to have effect!

24.5 Simulating Gaussian data

To make things a little bit more worldly, we will simulate data of human adult weight and height. We assume that height is distributed according to a Gaussian distribution with \(\mu = 165\) cm and \(\sigma = 8\). Based on height, we simulate weight as \(0.4 * height\) plus Gaussian error with mean 0 and standard deviation 2.

\[ \begin{align} w & \sim Gaussian(\mu, \sigma)\\ \mu & = 0.4 * h\\ \sigma & = 2 \end{align} \]

library(tidyverse)

set.seed(62854)

h <- round(rnorm(200, 165, 8))
w <- 0.4 * h + rnorm(200, 0, 2)

weight <- tibble(h, w)

The code uses the rnorm() function which generates a random sample of values from a Gaussian distribution. The function takes three arguments:

  • The number of values to sample, here 200.
  • The mean of the Gaussian distribution, here 165.
  • The standard deviation of the Gaussian distribution, here 8.
Note

The name of the rnorm() function is composed of:

  • r for random.
  • norm for “normal”, the Gaussian distribution.

We use the round() function to round the values generated by rnorm() to the nearest integer.

The following plot shows the density curve of the simulated height data. The purple vertical line is the mean.

weight |> 
  ggplot(aes(h)) +
  geom_density(fill = "darkgreen", alpha = 0.2) +
  geom_rug() +
  geom_vline(aes(xintercept = mean(h)), colour = "purple", linewidth = 1)

24.6 What’s next

In this post you have learned the very basics of Bayesian regression models. As mentioned above, regression models with brms are very flexible and you can easily fit very complex models with a variety of distribution families (for a list, see ?brmsfamily; you can even define your own distributions!).

The perk of using brms is that you can just learn the basics of one package and one approach and use it to fit a large variety of regression models.

This is different from the standard frequentist approach, where different models require different packages or functions, with their different syntax and quirks.

In the following posts, you will build your understanding of Bayesian regression models, which will enable you to approach even the most complex models! However, due to time limits you won’t learn everything there is to learn.

Developing conceptual and practical skills in quantitative methods is a long-term process and unfortunately one semester will not be enough. So be prepared to continue your learning journey for years to come!