23  Introduction to regression

In Chapter 18, you have learned the basics of probability and how to run Gaussian models to estimate the mean and standard deviation (\(\mu\) and \(\sigma\)) of a variable. This chapter extends the Gaussian model to what is commonly called a Gaussian regression model. Regression models (including the Gaussian) are models based on the equation of a straight line. They allow you to model the relationship between two or more variables. This textbook introduces you to regression models of varying complexity which can model variables frequently encountered in linguistics. Note that regression models are very powerful and flexible statistical models which can deal with a great variety of types of variables. [TODO: Appendix of types of regression models cheatsheet].

23.1 A straight line

A regression model is a statistical model that estimates the relationship between an outcome variable and one or more predictor variables (more on outcome/predictor below). Regression models are based on the equation of a straight line.

\[ y = mx + c \]

An alternative notation of the equation is:

\[ y = \beta_0 + \beta_1 x \]

Where \(\beta_0\) is \(c\) and \(\beta_1\) is \(m\) in the first notation. We will use the second notation (with \(\beta_0\) and \(beta_1\)) in this book, since using \(\beta\)’s with subscript indexes will help understand the process of extracting information from regression models later.1

23.2 Back to school

You might remember from school when you were asked to find the values of \(y\) given certain values of \(x\) and specific values of \(\beta_0\) and \(\beta_1\). For example, you were given the following formula (the dot \(\cdot\) stands for multiplication; it can be dropped so \(2 \cdot x\) and \(2x\) are equivalent):

\[ y = 3 + 2 \cdot x \]

and the values \(x = (2, 4, 5, 8, 10, 23, 36)\). The homework was to calculate the values of \(y\) and maybe plot them on a Cartesian coordinate space.

Code
library(tidyverse)

line <- tibble(
  x = c(2, 4, 5, 8, 10, 23, 36),
  y = 3 + 2 * x
)

ggplot(line, aes(x, y)) +
  geom_point(size = 4) +
  geom_line(colour = "red") +
  labs(title = bquote(italic(y) == 3 + 2 * italic(x)))

Using the provided formula, we are able to find the values of \(y\). Note that in \(y = 3 + 2 * x\), \(\beta_0 = 3\) and \(\beta_1 = 2\). Importantly, \(\beta_0\) is the value of \(y\) when \(x = 0\). \(\beta_0\) is commonly called the intercept of the line. The intercept is the value where the line crosses the y-axis (the value where the line “intercepts” the y-axis).

\[ \begin{align} y & = 3 + 2 * x\\ & = 3 + 2 * 0\\ & = 3\\ \end{align} \]

And \(\beta_1\) is the number to add to the intercept for each unit increase of \(x\). \(\beta_1\) is commonly called the slope of the line.2

\[ \begin{align} y = 3 + 2x = 3 + 2 * 1 = 3 + 2 = 5\\ y = 3 + 2 * 2 = 3 + (2 + 2) = 7\\ y = 3 + 2 * 3 = 3 + (2 + 2 + 2) = 9\\ \end{align} \]

Figure 23.1 should clarify this. The dashed line indicates the increase in \(y\) for every unit increase of \(x\) (i.e., every time \(x\) increases by 1, \(y\) increases by 2).

Code
line <- tibble(
  x = 0:3,
  y = 3 + 2 * x
)

ggplot(line, aes(x, y)) +
  geom_point(size = 4) +
  geom_line(colour = "red") +
  annotate("path", x = c(0, 0, 1), y = c(3, 5, 5), linetype = "dashed") +
  annotate("path", x = c(1, 1, 2), y = c(5, 7, 7), linetype = "dashed") +
  annotate("path", x = c(2, 2, 3), y = c(7, 9, 9), linetype = "dashed") +
  annotate("text", x = 0.25, y = 4.25, label = "+2") +
  annotate("text", x = 1.25, y = 6.25, label = "+2") +
  annotate("text", x = 2.25, y = 8.25, label = "+2") +
  scale_y_continuous(breaks = 0:15) +
  labs(title = bquote(italic(y) == 3 + 2 * italic(x)))
Figure 23.1: Illustration of the meaning of the slope: with a slope of 2, for each unit increase of \(x\), \(y\) increases by 2.

Now, in the context of research, you usually start with a sample of measures (values) of \(x\) (the predictor variable) and \(y\) (the outcome variable). Then you have to estimate (i.e. to find the values of) \(\beta_0\) and \(\beta_1\) from the formula. This is what regression models are for: given the sampled values of \(y\) and \(x\), the model estimates \(\beta_0\) and \(\beta_1\).

Exercise
  • Go to the web app Linear Models Illustrated.

  • In the first tab, “Continuous”, you will find instructions on the left and a plot on the right. The plot on the right is the plot resulting from the parameters specified to the left.

  • Play around with the intercept and slope parameters to see what happens to the line with different values of \(\beta_0\) and \(\beta_1\).

23.3 Add error

Measurements are noisy: they usually contain errors. Error can have many different causes (for example, measurement error due to technical limitations or variability in human behaviour), but we are usually not that interested in learning about those causes. Rather, we just want our model to be able to deal with error. Let’s see what errors looks like. Figure 23.2 shows values of \(y\) simulated with the equation \(y = 1 + 1.5x\) (with \(x\) equal 1 to 10), to which the random error \(\epsilon\) was added. Due to the added error, the points are almost on the straight line defined by \(y = 1 + 1.5x\), but not quite. The vertical distance between the observed points and the expected line, called the regression line, is the residual error (red lines in the plot).

Code
set.seed(4321)
x <- 1:10
y <- (1 + 1.5 * x) + rnorm(10, 0, 2)

line <- tibble(
  x = x,
  y = y
)

m <- lm(y ~ x)
yhat <- m$fitted.values
diff <- y - yhat
ggplot(line, aes(x, y)) +
  geom_segment(aes(x = x, xend = x, y = y, yend = yhat), colour = "red") +
  geom_point(size = 4) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  scale_x_continuous(breaks = 1:10) +
  labs(title = bquote(italic(y) == 1 + 1.5 * italic(x) + epsilon))
Figure 23.2: Illustration of residual error.

When taking into account error, the equation of a regression model becomes the following:

\[ y = \beta_0 + \beta_1 * x + \epsilon \]

where \(\epsilon\) is the error. In other words, \(y\) is the sum of \(\beta_0\), \(\beta_1 * x\) and some error. In regression modelling, the error \(\epsilon\) is assumed to come from a Gaussian distribution with mean 0 and standard deviation \(\sigma\) when \(y\) is assumed to be generated by a Gaussian distribution: \(Gaussian(\mu = 0, \sigma)\).

\[ y = \beta_0 + \beta_1 * x + Gaussian(0, \sigma) \]

This equation can be rewritten like so (since the mean of the Gaussian error is 0):

\[ \begin{align} y & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 * x\\ \end{align} \]

You can read those formulae like so: “The variable \(y\) is distributed according to a Gaussian distribution with mean \(\mu\) and standard deviation \(\sigma\). The mean \(\mu\) is equal to the intercept \(\beta_0\) plus the slope \(\beta_1\) times the variable \(x\).” This is a Gaussian regression model, because the assumed family of the outcome \(y\) is Gaussian. Now, the goal of a (Gaussian) regression model is to estimate \(\beta_0\), \(\beta_1\) and \(\sigma\) from the data (i.e. from the values of \(x\) and \(y\)). Of course, regression models are not limited to the Gaussian distribution family and in fact regression models can be fit with virtually any other distribution family (the other most common families are Bernoulli, Poisson, beta and cumulative). In the following chapters, you will learn how to fit regression models to a variety of data to answer linguistic research questions.

Galton and regression

The basic logic of regression models is attributed to Francis Galton (1822–1911). Galton studied the relationship between the heights of parents and their children (Galton 1980, 1886). He noticed that while tall parents tended to have tall children, the children’s heights were often closer to the average height of the population. This phenomenon, which he called “regression toward mediocrity” (now known as “regression to the mean”), showed that extreme values (e.g., very tall or very short parents) were less likely to be perfectly transmitted to the next generation.

The core idea of Galton’s framework can be expressed as a regression model:

\[ y = \beta_0 + \beta_1 \cdot x + \epsilon \]

where:

  • \(y\): the child’s height (response variable),
  • \(x\): the average of the parents’ heights (predictor variable),
  • \(\beta_0\): the intercept (the expected height of a child when the parents’ height is at the mean),
  • \(\beta_1\)​: the slope, representing the rate of change in the child’s height with respect to the parents’ height,
  • \(\epsilon\): the error term, accounting for random variability.

Galton found that the slope \(\beta_1\) was less than 1, meaning that the children’s heights were not as extreme as their parents’ heights. For example, if tall parents (above the mean) had an average child height increase of \(\beta_1 < 1\), it indicated a “regression” toward the population mean. The intercept \(\beta_0\) ensured the line passed through the mean of both parents’ and children’s heights.

Galton, eugenics and racism

Galton is considered one of the founders of modern statistics and is widely recognized for his contributions to fields such as regression, correlation, and the study of heredity. However, his work is also deeply intertwined with controversial and now discredited views on race and eugenics. Galton coined the term eugenics in 1883, defining it as the “science of improving the genetic quality of the human population”. His goal was to encourage the reproduction of individuals he deemed “fit” and discourage that of those he considered “unfit”. He promoted selective breeding among humans, drawing inspiration from animal breeding practices.

Galton believed in a hierarchy of intelligence and ability among “races”, a belief that was common among many European intellectuals of his time. In works like Hereditary Genius (1869), he argued that intelligence and other traits were hereditary and that Europeans were superior to other racial groups. These conclusions were based on flawed assumptions and biased interpretations of data. His ideas contributed to the spread of pseudo-scientific racism, which attempted to justify inequality and colonialism.

Galton’s eugenic ideas were later used to justify discriminatory policies, including forced sterilization programs and racial segregation in various countries. While Galton himself did not directly advocate for many of the extreme measures implemented in the 20th century, his work laid the groundwork for such abuses. His promotion of eugenics and racial hierarchies has left a damaging legacy.


  1. Yet other notations are \(y = a + bx\) and \(y = \alpha + \beta x\).↩︎

  2. Mathematically, it is called the gradient, but in regression modelling the word slope is commonly used.↩︎