Code
library(tidyverse)
<- tibble(
line x = c(2, 4, 5, 8, 10, 23, 36),
y = 3 + 2 * x
)
ggplot(line, aes(x, y)) +
geom_point(size = 4) +
geom_line(colour = "red") +
labs(title = bquote(italic(y) == 3 + 2 * italic(x)))
In Chapter 18, you have learned the basics of probability and how to run Gaussian models to estimate the mean and standard deviation (\(\mu\) and \(\sigma\)) of a variable. This chapter extends the Gaussian model to what is commonly called a Gaussian regression model. Regression models (including the Gaussian) are models based on the equation of a straight line. They allow you to model the relationship between two or more variables. This textbook introduces you to regression models of varying complexity which can model variables frequently encountered in linguistics. Note that regression models are very powerful and flexible statistical models which can deal with a great variety of types of variables. [TODO: Appendix of types of regression models cheatsheet].
A regression model is a statistical model that estimates the relationship between an outcome variable and one or more predictor variables (more on outcome/predictor below). Regression models are based on the equation of a straight line.
\[ y = mx + c \]
An alternative notation of the equation is:
\[ y = \beta_0 + \beta_1 x \]
Where \(\beta_0\) is \(c\) and \(\beta_1\) is \(m\) in the first notation. We will use the second notation (with \(\beta_0\) and \(beta_1\)) in this book, since using \(\beta\)’s with subscript indexes will help understand the process of extracting information from regression models later.1
You might remember from school when you were asked to find the values of \(y\) given certain values of \(x\) and specific values of \(\beta_0\) and \(\beta_1\). For example, you were given the following formula (the dot \(\cdot\) stands for multiplication; it can be dropped so \(2 \cdot x\) and \(2x\) are equivalent):
\[ y = 3 + 2 \cdot x \]
and the values \(x = (2, 4, 5, 8, 10, 23, 36)\). The homework was to calculate the values of \(y\) and maybe plot them on a Cartesian coordinate space.
library(tidyverse)
<- tibble(
line x = c(2, 4, 5, 8, 10, 23, 36),
y = 3 + 2 * x
)
ggplot(line, aes(x, y)) +
geom_point(size = 4) +
geom_line(colour = "red") +
labs(title = bquote(italic(y) == 3 + 2 * italic(x)))
Using the provided formula, we are able to find the values of \(y\). Note that in \(y = 3 + 2 * x\), \(\beta_0 = 3\) and \(\beta_1 = 2\). Importantly, \(\beta_0\) is the value of \(y\) when \(x = 0\). \(\beta_0\) is commonly called the intercept of the line. The intercept is the value where the line crosses the y-axis (the value where the line “intercepts” the y-axis).
\[ \begin{align} y & = 3 + 2 * x\\ & = 3 + 2 * 0\\ & = 3\\ \end{align} \]
And \(\beta_1\) is the number to add to the intercept for each unit increase of \(x\). \(\beta_1\) is commonly called the slope of the line.2
\[ \begin{align} y = 3 + 2x = 3 + 2 * 1 = 3 + 2 = 5\\ y = 3 + 2 * 2 = 3 + (2 + 2) = 7\\ y = 3 + 2 * 3 = 3 + (2 + 2 + 2) = 9\\ \end{align} \]
Figure 23.1 should clarify this. The dashed line indicates the increase in \(y\) for every unit increase of \(x\) (i.e., every time \(x\) increases by 1, \(y\) increases by 2).
<- tibble(
line x = 0:3,
y = 3 + 2 * x
)
ggplot(line, aes(x, y)) +
geom_point(size = 4) +
geom_line(colour = "red") +
annotate("path", x = c(0, 0, 1), y = c(3, 5, 5), linetype = "dashed") +
annotate("path", x = c(1, 1, 2), y = c(5, 7, 7), linetype = "dashed") +
annotate("path", x = c(2, 2, 3), y = c(7, 9, 9), linetype = "dashed") +
annotate("text", x = 0.25, y = 4.25, label = "+2") +
annotate("text", x = 1.25, y = 6.25, label = "+2") +
annotate("text", x = 2.25, y = 8.25, label = "+2") +
scale_y_continuous(breaks = 0:15) +
labs(title = bquote(italic(y) == 3 + 2 * italic(x)))
Now, in the context of research, you usually start with a sample of measures (values) of \(x\) (the predictor variable) and \(y\) (the outcome variable). Then you have to estimate (i.e. to find the values of) \(\beta_0\) and \(\beta_1\) from the formula. This is what regression models are for: given the sampled values of \(y\) and \(x\), the model estimates \(\beta_0\) and \(\beta_1\).
Measurements are noisy: they usually contain errors. Error can have many different causes (for example, measurement error due to technical limitations or variability in human behaviour), but we are usually not that interested in learning about those causes. Rather, we just want our model to be able to deal with error. Let’s see what errors looks like. Figure 23.2 shows values of \(y\) simulated with the equation \(y = 1 + 1.5x\) (with \(x\) equal 1 to 10), to which the random error \(\epsilon\) was added. Due to the added error, the points are almost on the straight line defined by \(y = 1 + 1.5x\), but not quite. The vertical distance between the observed points and the expected line, called the regression line, is the residual error (red lines in the plot).
set.seed(4321)
<- 1:10
x <- (1 + 1.5 * x) + rnorm(10, 0, 2)
y
<- tibble(
line x = x,
y = y
)
<- lm(y ~ x)
m <- m$fitted.values
yhat <- y - yhat
diff ggplot(line, aes(x, y)) +
geom_segment(aes(x = x, xend = x, y = y, yend = yhat), colour = "red") +
geom_point(size = 4) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
scale_x_continuous(breaks = 1:10) +
labs(title = bquote(italic(y) == 1 + 1.5 * italic(x) + epsilon))
When taking into account error, the equation of a regression model becomes the following:
\[ y = \beta_0 + \beta_1 * x + \epsilon \]
where \(\epsilon\) is the error. In other words, \(y\) is the sum of \(\beta_0\), \(\beta_1 * x\) and some error. In regression modelling, the error \(\epsilon\) is assumed to come from a Gaussian distribution with mean 0 and standard deviation \(\sigma\) when \(y\) is assumed to be generated by a Gaussian distribution: \(Gaussian(\mu = 0, \sigma)\).
\[ y = \beta_0 + \beta_1 * x + Gaussian(0, \sigma) \]
This equation can be rewritten like so (since the mean of the Gaussian error is 0):
\[ \begin{align} y & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 * x\\ \end{align} \]
You can read those formulae like so: “The variable \(y\) is distributed according to a Gaussian distribution with mean \(\mu\) and standard deviation \(\sigma\). The mean \(\mu\) is equal to the intercept \(\beta_0\) plus the slope \(\beta_1\) times the variable \(x\).” This is a Gaussian regression model, because the assumed family of the outcome \(y\) is Gaussian. Now, the goal of a (Gaussian) regression model is to estimate \(\beta_0\), \(\beta_1\) and \(\sigma\) from the data (i.e. from the values of \(x\) and \(y\)). Of course, regression models are not limited to the Gaussian distribution family and in fact regression models can be fit with virtually any other distribution family (the other most common families are Bernoulli, Poisson, beta and cumulative). In the following chapters, you will learn how to fit regression models to a variety of data to answer linguistic research questions.