# Set the seed for reproducibility
set.seed(9182)
rbinom(1, 10, 0.5)
[1] 6
Binary outcome variables are very common in linguistics. These are categorical variable that have two levels, e.g.:
So far you have been fitting regression models in which the outcome variable was numeric and continuous. However, a lot of studies use binary outcome variables and it thus important to learn how to deal with those. This is what this chapter is about.
When modelling binary outcomes, what the researcher is usually interested in is the probability of obtaining one of the two levels. For example, in a lexical decision task one might want to know the probability that real words were recognised as such (in other words, we are interested in accuracy: incorrect or correct response). Let’s say there is an 80% probability of responding correctly. So (\(p()\) stands for “probability of”):
\[ \begin{align} p(\text{correct}) & = 0.8\\ p(\text{incorrect}) & = 1 - p(\text{correct}) = 0.2 \end{align} \]
You see that if you know the probability of one level (correct) you automatically know the probability of the other level, since there are only two levels and the total probability has to sum to 1. The distribution family for binary probabilities is the Bernoulli family. The Bernoulli family has only one parameter, \(p\), which is the probability of obtaining one of the two levels (one can pick which level). With our lexical decision task example, we can write:
\[ \begin{align} resp_{\text{correct}} & \sim Bernoulli(p) \\ p & = 0.8 \end{align} \]
You can read it as:
The probability of getting a correct response follows a Bernoulli distribution with \(p\) = 0.8.
If you randomly sampled from \(Bernoulli(0.8)\) you would get “correct” 80% of the times and “incorrect” 20% of the times. We can test this in R. In previous chapters we used the rnorm()
function to generate random numbers from Gaussian distributions. R doesn’t have an rbern()
function, so we have to use the rbinom()
function instead. The function generates random observations from a binomial distribution: the binomial distributions is a more general form of the Bernoulli distribution. It has two parameters, \(n\) the number of trials and \(p\) the “success” probability of each trial. If we code each level in the binary variable as 0 and 1, \(p\) is the probability of getting 1 (that’s why it is called the “success” probability).
\[ Binomial(n, p) \]
Think of a coin: you flip it 10 times so \(n = 10\) (ten trials). If this is a fair coin, then \(p\) should be 0.5: 50% of the times you get head (1) and 50% of the times you get tail (0). The rbinom()
function takes three arguments: n
number of observations (maybe confusingly, not the number of trials), size
the number of trials, and p
the probability of success. The following code simulates 10 flips of a fair coin with rbinom()
.
# Set the seed for reproducibility
set.seed(9182)
rbinom(1, 10, 0.5)
[1] 6
The output is 6
, meaning 6 out of 10 flips had head (1). Note that the probability of success \(p\) is the mean probability of success. In any one observation, you won’t necessarily get 5 of 10 with \(p = 0.5\), but if you take many 10-trial observations, then on average you should get pretty close to 0.5. Let’s try this:
# Set the seed for reproducibility
set.seed(9182)
mean(rbinom(100, 10, 0.5))
[1] 5.08
Here we took 100 observations of 10 flips. On average, about 5 of 10 flips got head (it is not precisely 5, but very close).
A Bernoulli distribution is simply a binomial distribution with a single trial. Imagine again a lexical decision task: each word presented to the participant is one trial and in each trial there is a probability \(p\) of getting it right (correctly identifying the type of the word). We can thus use rbinom()
with size
set to 1. Let’s get 25 observations of 1 trial each.
# Set the seed for reproducibility
set.seed(9182)
rbinom(25, 1, 0.8)
[1] 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
For each trial, we get a 1
for correct or a 0
for incorrect. If you take the mean of the trials it should be very close to 0.8 (with those random observations, it is 0.84). Again, \(p\) is the mean probability of success across trials.
Now, what we are trying to do when modelling binary outcome variables is to estimate the probability \(p\) from the data. But there is a catch: probabilities are bounded between 0 and 1 and regression models don’t work with bounded variables out of the box! Bounded probabilities are transformed into an unbounded numeric variable. The following section explains how.
As we have just learned, probabilities are bounded between 0 and 1 but we need something that is not bounded because regression models don’t work with bounded numeric variables. This is where the logit function comes in: the logit function (from “logistic unit”) is a mathematical function that transforms probabilities into log-odds. The logit function is the quantile function (the function that returns quantiles, the value below which a given proportion of a probability distribution lies) of the logistic distribution. The logit function is the quantile of a logistic function with mean 0 and standard deviation 1 (a standard logistic distribution). In R, the logit function is qlogis()
. The default mean and SD in qlogis()
are 0 and 1 respectively, so you can just input the first argument, p
which is the probability you want to transform into log-odds.
qlogis(0.2)
[1] -1.386294
qlogis(0.5)
[1] 0
qlogis(0.8)
[1] 1.386294
Figure 30.1 shows the logit function (the quantile function of the standard logistic distribution). The probabilities on the x-axis are transformed into log-odds on the y axis. When you fit a regression model with a binary outcome and a Bernoulli family, the estimates of the regression coefficients are in log-odds.
<- tibble(
dots p = seq(0.1, 0.9, by = 0.1),
log_odds = qlogis(p)
)
<- tibble(
log_odds_p p = seq(0, 1, by = 0.001),
log_odds = qlogis(p)
%>%
) ggplot(aes(p, log_odds)) +
geom_vline(xintercept = 0.5, linetype = "dashed") +
geom_vline(xintercept = 0, colour = "#8856a7", linewidth = 1) +
geom_vline(xintercept = 1, colour = "#8856a7", linewidth = 1) +
geom_hline(yintercept = 0, alpha = 0.5) +
geom_line(linewidth = 2) +
geom_point(data = dots, size = 4) +
geom_point(x = 0.5, y = 0, colour = "#8856a7", size = 4) +
annotate("text", x = 0.2, y = 3, label = "logit(p) = log-odds") +
scale_x_continuous(breaks = seq(0, 1, by = 0.1), minor_breaks = NULL, limits = c(0, 1)) +
scale_y_continuous(breaks = seq(-6, 6, by = 1), minor_breaks = NULL) +
labs(
x = "Probability",
y = "Log-odds"
) log_odds_p
To go back from log-odds to probabilities, we use the inverse logit function. This is the CDF of the standard logistic distribution. In R, you apply the inverse logit function (also called the logistic function, because it is the CDF of the standard logistic distribution) with plogis()
. As with qnorm()
, the default mean and SD are 0 and 1 respectively so you can just input the log-odds you want to transform into probabilities as the first argument q
.
plogis(-3)
[1] 0.04742587
plogis(0)
[1] 0.5
plogis(2)
[1] 0.8807971
plogis(6)
[1] 0.9975274
# To show that plogis is the inverse of qlogis
plogis(qlogis(0.2))
[1] 0.2
Figure 30.1 shows the inverse logit transformation of log-odds (on the x-axis) into probabilities (on the y-axis). The inverse logit constructs the typical S-shaped curve (black thick line) of the CDF of the standard logistic distribution. Since probabilities can’t be smaller than 0 and greater than 1, the black line slopes in either direction and it approaches 0 and 1 on the y-axis without ever reaching them (in mathematical terms, it’s an asymptotic line). It is helpful to just memorise that log-odds 0 corresponds to probability 0.5 (and vice versa of course).
<- tibble(
dots p = seq(0.1, 0.9, by = 0.1),
log_odds = qlogis(p)
)
<- tibble(
p_log_odds p = seq(0, 1, by = 0.001),
log_odds = qlogis(p)
%>%
) ggplot(aes(log_odds, p)) +
geom_hline(yintercept = 0.5, linetype = "dashed") +
geom_hline(yintercept = 0, colour = "#8856a7", linewidth = 1) +
geom_hline(yintercept = 1, colour = "#8856a7", linewidth = 1) +
geom_vline(xintercept = 0, alpha = 0.5) +
geom_line(linewidth = 2) +
# geom_point(data = dots, size = 4) +
geom_point(x = 0, y = 0.5, colour = "#8856a7", size = 4) +
annotate("text", x = -4, y = 0.8, label = "inv_logit(log-odds) = p") +
scale_x_continuous(breaks = seq(-6, 6, by = 1), minor_breaks = NULL, limits = c(-6, 6)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), minor_breaks = NULL) +
labs(
x = "Log-odds",
y = "Probability"
) p_log_odds
To illustrate how to fit a Bernoulli model, we will use data from Brentari et al. (2024) on the emergent Nicaraguan Sign Language (Lengua de Señas Nicaragüense, NSL).
<- read_csv("data/brentari2024/verb_org.csv")
verb_org
verb_org
verb_org
contains information on predicates as signed by three groups (Group
): home-signers (homesign
), first generation NSL signers (NSL1
) and second generation NSL signers (NSL2
). Specifically, the data coded in Num_Predicates
whether the predicates uttered by the signer were single-verb predicates (SVP, single
) or a multi-verb predicates (MVP, multiple
). The hypothesis of the study is that use of multi-verb predicates would increase with each generation, i.e. that NSL1 signers would use more MVPs than home-signers and that NSL2 signers would use more MVPs than home-signers and NSL1 signers. (For the linguistic reasons behind this hypothesis, check the paper linked above). Figure 30.3 shows the proportion of single vs multiple predicates in the three groups.
|>
verb_org ggplot(aes(Group, fill = Num_Predicates)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Dark2") +
labs(
y = "Proportion"
)
What do you notice about the type of predicates in the three groups? Does it match the hypothesis put forward by the paper? We can calculate the proportion of multi-verb predicates by creating a column which codes Num_Predicates
with 0
for single and 1
for multiple predicates. This is the same dummy coding we encountered in Chapter 27 for categorical predictors, but now we apply that to a binary outcome variable. Then you just take the mean of the dummy column by group, to obtain the proportion of multi-verb predicates for each group. Note that since we code multi-verb predicates with 1
, taking the mean gives you the proportion of multi-verb predicates and the proportion of single predicates is just \(1 - p\) where \(p\) is the proportion of multi-verb predicates.
<- verb_org |>
verb_org mutate(
Num_Pred_dum = ifelse(Num_Predicates == "multiple", 1, 0)
)
|>
verb_org group_by(Group) |>
summarise(
prop_multi = round(mean(Num_Pred_dum), 2)
)
On average, home-signers produce 36% of multi-verb predicates, first generation signers (NSL1) 19% and second generation signers (NSL2) 50%. In other words, there is a decrease in production of multi-verb predicates from home-signers to NSL1 signers, while NSL2 signers basically just use either single or multi-verb predicates. Most times, it is also useful to plot the proportion for each participant separately. Unfortunately, this is an area were bad practices have become standard so it is worth spending some time on this, in a new section.
A common (but incorrect) way of plotting proportion/percentage data (like accuracy, or the single vs multi-verb predicates of the verb_org
dataset) is to calculate the proportion of each participant and then produce a bar chart with error bars that indicate the mean proportion (i.e. the mean of the proportions of each participant) and the dispersion around the mean accuracy. You might have seen something like Figure 30.4 in many papers. This type of plot is called “bars and whiskers” because the error bars look like cat whiskers.
|>
verb_org group_by(Participant, Group) |>
summarise(prop = sum(Num_Pred_dum) / n(), .groups = "drop") |>
group_by(Group) |>
add_tally() |>
summarise(mean_prop = mean(prop), sd_prop = sd(prop)) |>
ggplot(aes(Group, mean_prop)) +
geom_bar(stat = "identity") +
geom_errorbar(
aes(
ymin = mean_prop - sd_prop / sqrt(46),
ymax = mean_prop + sd_prop / sqrt(46)
),width = 0.2
)
THE HORROR. Alas, this is a very bad way of processing proportion data. You can learn why in Holtz (2019).1 The alternative (robust) way to plot proportion data is to show the proportion for individual participants. To do so we can use a combination of summarise()
, geom_jitter()
and stat_summary()
. First, we need to compute the proportion of multi-verb predicates by participant. The procedure is the same as calculating the proportion for the three signer groups, but now we do it for each participant. Note that participants only have a unique ID within group, so we need to group by participant and group (we need the group information any way to be able to plot participants by group below). We save the output to a new tibble verb_part
.
<- verb_org |>
verb_part group_by(Participant, Group) |>
summarise(
prop_multi = round(mean(Num_Pred_dum), 2),
.groups = "drop"
)
Now we can plot the proportions and add mean and confidence intervals using geom_jitter()
and stat_summary()
. Before proceeding, you need to install the Hmisc package. There is no need to attach it (it is used by stat_summary()
under the hood). Remember not to include the code for installation in your document; you need to install the package only once. After several years of teaching, I still see a lot of students having install.packages()
in their scripts.
ggplot() +
# Proportion of each participant
geom_jitter(
data = verb_part,
aes(x = Group, y = prop_multi),
width = 0.1, alpha = 0.5
+
) # Mean proportion by stimulus with confidence interval
stat_summary(
data = verb_org,
aes(x = Group, y = Num_Pred_dum, colour = Group),
fun.data = "mean_cl_boot", size = 0.5
+
) labs(
title = "Proportion of takete responses by participant and stimulus",
caption = "Mean proportion is represented by coloured points with 95% bootstrapped Confidence Intervals.",
x = "Stimulus",
y = "Proportion"
+
) ylim(0, 1) +
theme(legend.position = "none")
In this data we don’t have many participants for group, but you can immediately appreciate the variability between participants. You will notice something new in the code: we have specified the data inside geom_jitter()
and stat_summary()
instead of inside ggplot()
. This is because the two functions need different data: geom_jitter()
needs the data with the proportion we calculated for each participant and group; stat_summary()
needs to calculate the mean and CIs from the overall data, rather than from the proportion data we created. This is the aspect that a lot of researchers get wrong: you should not take the mean of the participant proportions, unless all participants have exactly the same number of observations. In the verb_org
data specifically, the number of observations do indeed differ for each participant. You can check this yourself by getting the number of observations per participant per group with the count()
function. We are also specifying the aesthetics within each geom/stat function, because while x
is the same, the y
differs! In stat_summary()
, the fun.data
argument lets you specify the function you want to use for the summary statistics to be added. Here we are using the mean_cl_boot
function, which returns the mean proportion of Response_num
and the 95% Confidence Intervals (CIs) of that mean. The CIs are calculated using a bootstrapping procedure (if you are interested in learning what that is, check the documentation of smean.sd
from the Hmisc package).
This also makes a nice opportunity to mention one shortcoming of the models we are fitting, thinking a bit ahead of ourselves: the regression models we cover in Week 4 to 10 do not account for the fact that the data comes from multiple participants and instead they just assume that each observation in the data set is from a different participant. When you have multiple observations from each participant, we call this a repeated measure design. This is a problem because it breaks one assumption of regression models: that all the observations are independent from each other. Of course, if they come from the same participant, they are not independent. We won’t have time in this course to learn about the solution (i.e. hierarchical regression models, also known as multi-level, mixed-effects, nested-effects, or random-effects models; these are just regression models that are set up so they can account for hierarchical grouping in the data, like multiple observations from multiple participants, or multiple pupils from multiple classes from multiple schools), but I point you to further resources in ?sec-next (TBA).
To statistically assess the hypothesis and obtain estimates of the proportion of multiple predicates, we can fit a Bernoulli model with Num_Predicates
as the outcome variable and Group
as a categorical predictor.
Before we move on onto fitting the model, it is useful to transform Num_Predicates
into a factor and specify the order of the levels so that single
is the first level and multiple
is the second level.
This is useful because Bernoulli models estimate the probability (the parameter \(p\) in \(Bernoulli(p)\) of getting the second level in the outcome variable.
You can also think of this in terms of 0
s and 1
s: the first level is assigned to 0
and the second level is assigned to 1
. Then a Bernoulli distribution with probability \(p\) tells you the probability of getting a 1
. It doesn’t matter how you prefer to think about Bernoulli distributions, as long as you remember that the probability being estimated is the probability of the second level.
Now let’s mutate verb_org
.
<- verb_org |>
verb_org mutate(
Num_Predicates = factor(Num_Predicates, levels = c("single", "multiple"))
)
If you reproduce the plot above you will see now that the order of Num_Predicates
in the legend is “single” then “multiple” and that the order of the proportions in the bar chart have flipped.
Now we can move on onto modelling.
\[ \begin{align} \text{Num\_Preds}_{MVP} & \sim Bernoulli(p_i) \\ logit(p_i) & = \alpha_{\text{Group}[i]} \\ \end{align} \]
The probability of using an MVP follows a Bernoulli distribution with probability \(p\).
The log-odds of \(p\) are equal to \(\alpha\) for each Group.
In other words, the model estimates \(p\) for each group. Here is the code. Remember that to use the indexing approach for categorical predictors (Group
) we need to suppress the intercept with the 0 +
syntax.
<- brm(
mvp_bm ~ 0 + Group,
Num_Predicates family = bernoulli,
data = verb_org,
cores = 4,
seed = 1329,
file = "cache/ch-regression-bernoulli_mvp_bm"
)
Let’s inspect the model summary (we will get 80% CrIs).
summary(mvp_bm, prob = 0.8)
Family: bernoulli
Links: mu = logit
Formula: Num_Predicates ~ 0 + Group
Data: verb_org (Number of observations: 630)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-80% CI u-80% CI Rhat Bulk_ESS Tail_ESS
Grouphomesign -0.57 0.15 -0.76 -0.38 1.00 4424 3005
GroupNSL1 -1.46 0.17 -1.68 -1.24 1.00 3797 2995
GroupNSL2 -0.02 0.14 -0.21 0.16 1.00 4020 2851
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Based on the model, there is an 80% probability that the log-odds of a MVP are between -0.76 and -0.38 in home-signers, between -1.68 and -1.24 in NSL1 signers and between -0.21 and 0.16 in NSL2 signers.
It’s easier to understand the results if we convert the log-odds to probabilities. The quickest way to do this is to get the Regression Coefficients
table from the summary with fixef()
and mutate the Q
columns with plogis()
.
fixef(mvp_bm, prob = c(0.1, 0.9)) |>
# we need to convert the output of fixef() to a tibble to use mutate()
as_tibble() |>
# we plogis() the Q columns and round to the second digit
mutate(
Q10 = round(plogis(Q10), 2),
Q90 = round(plogis(Q90), 2)
)
Based on the model, there is an 80% probability that the probability of using an MVP is between 32-41% in home-signers, between 16-22% in NSL1 signers and between 45-54% in NSL2 signers.
We can now see more clearly that the hypothesis of the study is not fully borne out by the data: while NSL2 signers are more likely to use an MVP than home-signers and NSL1 signers, it is not the case that NSL1 signers are more likely to use MVPs than home-signers.
To conclude this introduction to Bernoulli models (aka binomial/logistic regressions) we can get the predicted probabilities of use of MVPs in the three groups with conditional_effects()
.
conditional_effects(mvp_bm)
This StackExchange answer is also useful: https://stats.stackexchange.com/a/367889/128897.↩︎