Introduction to Bayesian Linear Models

01 - Basics

Stefano Coretta

University of Edinburgh

Download the workshop materials

Schedule

time
10:00-10:50 (50’) Session 1
10:50-11:00 (10’) break
11:00-12:00 (60’) Session 2
12:00-13:00 (60’) LUNCH
13:00-14:00 (60’) Session 3
14:00-14:15 (15’) break
14:15-16:00 (45’) Session 4
16:00-16:45 OPTIONAL consultation time

One model to rule them all.
One model to find them.
One model to bring them all
And in the darkness bind them.

(No, I couldn’t translate ‘model’ into Black Speech, alas…)

Now enter The Linear Model

Now enter The Bayesian Linear Model

All about the Bayes

Within the NHST (Frequentist) framework, the main analysis output is:

  • Point estimates of predictors’ parameters (with standard error).
  • P-values. 😈

Within the Bayesian framework, the main analysis output is:

  • Probability distributions of predictors’ parameters.

An example: MALD

Massive Auditory Lexical Decision (MALD) data (Tucker et al. 2019):

  • Auditory Lexical Decision task with real and nonce English words.
  • Reaction times and accuracy.
  • Total 521 subjects; subset of 30 subjects, 100 observations each.
mald
# A tibble: 3,000 × 7
   Subject Item         IsWord PhonLev    RT ACC       RT_log
   <chr>   <chr>        <fct>    <dbl> <int> <fct>      <dbl>
 1 15345   nihnaxr      FALSE     5.84   945 correct     6.85
 2 15345   skaep        FALSE     6.03  1046 incorrect   6.95
 3 15345   grandparents TRUE     10.3    797 correct     6.68
 4 15345   sehs         FALSE     5.88  2134 correct     7.67
 5 15345   cousin       TRUE      5.78   597 correct     6.39
 6 15345   blowup       TRUE      6.03   716 correct     6.57
 7 15345   hhehrnzmaxn  FALSE     7.30  1985 correct     7.59
 8 15345   mantic       TRUE      6.21  1591 correct     7.37
 9 15345   notable      TRUE      6.82   620 correct     6.43
10 15345   prowthihviht FALSE     7.68  1205 correct     7.09
# ℹ 2,990 more rows

MALD: phone-level distance and lexical status

Let’s investigate the effects of:

  • PhonLev: mean phone-level Levenshtein distance.
  • IsWord: lexical status (real word vs nonce word).

On:

  • RT: reaction times.
  • ACC: accuracy.

MALD: phone-level distance and lexical status

mald |> 
  ggplot(aes(PhonLev, RT)) +
  geom_point(alpha = 0.1) +
  geom_smooth(aes(colour = IsWord, fill = IsWord), method = "lm", formula = y ~ x) +
  labs(
    x = "Phone-level distance", y = "RT (ms)"
  )

A frequentist linear model

lm_1 <- lmer(
  RT ~
    PhonLev +
    IsWord +
    PhonLev:IsWord +
    (1 | Subject),
  data = mald
)

A frequentist linear model: summary

summary(lm_1)
Linear mixed model fit by REML ['lmerMod']
Formula: RT ~ PhonLev + IsWord + PhonLev:IsWord + (1 | Subject)
   Data: mald

REML criterion at convergence: 43232.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5511 -0.5985 -0.2439  0.3185  5.6906 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept)  11032   105.0   
 Residual             104648   323.5   
Number of obs: 3000, groups:  Subject, 30

Fixed effects:
                    Estimate Std. Error t value
(Intercept)          754.965     49.687  15.195
PhonLev               32.148      6.389   5.032
IsWordFALSE          212.440     65.453   3.246
PhonLev:IsWordFALSE  -11.620      9.076  -1.280

Correlation of Fixed Effects:
            (Intr) PhonLv IWFALS
PhonLev     -0.907              
IsWordFALSE -0.647  0.690       
PhL:IWFALSE  0.640 -0.705 -0.983

A frequentist linear model: plot predictions

ggpredict(lm_1, terms = c("PhonLev", "IsWord")) %>%
  plot()

Increase model complexity

Try it yourself! Does it work?

lm_2 <- lmer(
  RT ~
    PhonLev +
    IsWord +
    PhonLev:IsWord +
    (PhonLev + IsWord | Subject),
  data = mald
)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0430116 (tol = 0.002, component 1)

Let’s go Bayesian!

…like the Bayesian Rats!

A Bayesian linear model

brm_1 <- brm(
  RT ~
    PhonLev +
    IsWord +
    PhonLev:IsWord +
    (PhonLev + IsWord | Subject),
  data = mald,
  family = gaussian()
)

A Bayesian linear model: summary

brm_1
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ PhonLev + IsWord + PhonLev:IsWord + (PhonLev + IsWord | Subject) 
   Data: mald (Number of observations: 3000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Subject (Number of levels: 30) 
                           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)                 98.77     33.49    44.25   177.00 1.01     1156
sd(PhonLev)                    7.76      5.03     0.47    19.03 1.01      465
sd(IsWordFALSE)               97.43     19.11    62.96   139.07 1.00     1804
cor(Intercept,PhonLev)        -0.42      0.44    -0.93     0.68 1.00     1522
cor(Intercept,IsWordFALSE)     0.53      0.27    -0.05     0.94 1.01      589
cor(PhonLev,IsWordFALSE)      -0.36      0.44    -0.94     0.65 1.02      384
                           Tail_ESS
sd(Intercept)                  1261
sd(PhonLev)                    1031
sd(IsWordFALSE)                2778
cor(Intercept,PhonLev)         2153
cor(Intercept,IsWordFALSE)     1515
cor(PhonLev,IsWordFALSE)       1057

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept             755.18     49.73   659.14   853.47 1.00     2234     1457
PhonLev                31.94      6.60    18.84    44.46 1.00     2510     2759
IsWordFALSE           208.11     67.09    75.56   339.99 1.00     2523     2380
PhonLev:IsWordFALSE   -11.10      9.04   -28.42     6.65 1.00     2493     2788

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   320.27      4.25   312.03   328.73 1.00     5408     2783

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

A Bayesian linear model: plot predictions

conditional_effects(brm_1, effects = "PhonLev:IsWord")

Let’s start small

Try it yourself!

brm_2 <- brm(
  # formula = outcome ~ predictor
  RT ~ IsWord,
  data = mald,
  # Specify the distribution family of the outcome
  family = gaussian()
)
  • The model estimates the probability distributions of the effects of each predictor.

  • To do so, a sampling algorithm is used (Markov Chain Monte Carlo, MCMC).

MCMC what?

Markov Chain Monte Carlo

Technicalities

brm_2 <- brm(
  RT ~ IsWord,
  data = mald,
  family = gaussian(),
  
  # TECHNICAL STUFF
  # Save model output to file
  file = "./data/cache/brm_2.rds",
  # Number of MCMC chains
  chains = 4, 
  # Number of iterations per chain
  iter = 2000,
  # Number of cores to use (one per chain)
  cores = 4
)

You can find out how many cores your laptop has with parallel::detectCores().

The model summary

brm_2
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ IsWord 
   Data: mald (Number of observations: 3000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     981.35      8.73   964.38   998.37 1.00     3970     3032
IsWordFALSE   132.87     12.63   107.91   156.87 1.00     3936     2937

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   341.21      4.48   332.33   349.82 1.00     4223     3006

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Interpreting the summary

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     981.35      8.73   964.38   998.37 1.00     3970     3032
IsWordFALSE   132.87     12.63   107.91   156.87 1.00     3936     2937

Intercept: There is a 95% probability that (based on model and data) the mean RT when the word is real is between 964 and 999 ms.

  • The probability distribution of the intercept has mean = 981 ms and SD = 9 (rounded).

IsWordFALSE: At 95% confidence, the difference in RT between non-words and real words is between +109 and +158 ms (based on model and data).

  • The probability distribution of IsWordFALSE has mean = 133 ms and SD = 13 (rounded).

Posterior probabilities

  • These are posterior probability distributions.
  • They are always conditional on model and data.
  • The summary reports 95% Credible Intervals, but you can get other intervals too (there is nothing special about 95%).
summary(brm_2, prob = 0.8)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ IsWord 
   Data: mald (Number of observations: 3000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-80% CI u-80% CI Rhat Bulk_ESS Tail_ESS
Intercept     981.35      8.73   970.04   992.96 1.00     3970     3032
IsWordFALSE   132.87     12.63   116.58   148.84 1.00     3936     2937

Further Distributional Parameters:
      Estimate Est.Error l-80% CI u-80% CI Rhat Bulk_ESS Tail_ESS
sigma   341.21      4.48   335.59   346.97 1.00     4223     3006

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Quick plot

plot(brm_2, combo = c("dens", "trace"))

Extract the MCMC draws

brm_2_draws <- as_draws_df(brm_2)
brm_2_draws
# A draws_df: 1000 iterations, 4 chains, and 6 variables
   b_Intercept b_IsWordFALSE sigma Intercept lprior   lp__
1          981           144   349      1052    -13 -21763
2          989           131   343      1053    -13 -21761
3          973           138   338      1040    -13 -21761
4          983           126   337      1045    -13 -21761
5          996           114   346      1052    -13 -21762
6          966           156   339      1042    -13 -21762
7          973           140   343      1042    -13 -21761
8          996           127   344      1058    -13 -21762
9          990           123   340      1050    -13 -21761
10         970           145   341      1042    -13 -21761
# ... with 3990 more draws
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

Get parameters

To list all the names of the parameters in a model, use:

variables(brm_2)
[1] "b_Intercept"   "b_IsWordFALSE" "sigma"         "Intercept"    
[5] "lprior"        "lp__"         

Posterior distributions: intercept

Now you can plot the draws using ggplot2.

brm_2_draws %>%
  ggplot(aes(b_Intercept)) +
  stat_halfeye(fill = "#214d65", alpha = 0.8) +
  scale_x_continuous() +
  labs(title = "Posterior distribution of Intercept")

Posterior distributions: intercept

Posterior distribution: IsWordFalse

brm_2_draws %>%
  ggplot(aes(b_IsWordFALSE)) +
  stat_halfeye(fill = "#624B27", alpha = 0.8) +
  scale_x_continuous() +
  labs(title = "Posterior distribution of IsWord: FALSE - TRUE")

Posterior distribution: IsWordFalse

Conditional posterior probabilities

conditional_effects(brm_2, "IsWord")

Summary

  • Bayesian statistics is a framework for doing inference (i.e. learning something about a population from a sample).

  • The Bayesian framework is based on estimating uncertainty with probability distributions on parameters.

  • The syntax you know from [g]lm[er]() works with brm() from the brms package.

  • Results allow you to make statements like “There is a 95% probability that the difference between A and B is between q and w”.