25  Wrangling MCMC draws

25.1 MCMC what?

Bayesian regression models fitted with brms/Stan use the Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate the probability distributions of the model’s coefficient. You have first encountered MCMC in Chapter 22. The MCMC algorithm does two things: first it finds a probability area that is the most compatible with the prior probabilities and the probability of the data given the priors; then, it sample values (called draws) from this high-probability area to reconstruct the posterior distribution. Since the high-probability area is assumed to be representative of the posterior distribution (that’s why we sample from it), the MCMC draws are also called posterior draws. Elizabeth Pankratz has written a nice short introduction to MCMC algorithms here: Finding posteriors through sampling. I strongly recommend that you read that before proceeding (if you want to further your understanding of MCMC, the resources linked on that page are an excellent starting point). Note that to be a proficient user of brms and Bayesian regression models, you don’t need to fully understand the mathematics behind MCMC algorithms, as long as you understand them conceptually.

When you run a model with brms, the draws (i.e. the sampled values) are stored in the model object. All operations on a model, like obtaining the summary(), are actually operations on those draws. We normally fit regression models with four MCMC chains. The sampling algorithm within each chain runs for 2000 iterations by default. The first half (1000 iterations) are used to “warm up” the algorithm (i.e. find the high-probability area) while the second half (1000 iterations) are the ones that sample from the high-probability area to reconstruct the posterior distribution. For four chains ran with 2000 iterations of which 1000 for warm up, we end up with 4000 iterations we can use to learn details about the posterior.

The rest of the post will teach you how to extract and manipulate the model’s draws. We will first revisit the model fitted in Chapter 27.

25.2 Extract MCMC posterior draws

There are different ways to extract the MCMC posterior draws from a fitted model. In this book, we will use the as_draws_df() function from the posterior package. The function extracts the draws from a Bayesian regression model and outputs them as a data frame. Let’s take the rt_bm_1 model from Chapter 27 and extract the draws. Let’s first read the tucker2019/mald_1_1.rds data again.

library(tidyverse)

mald <- readRDS("data/tucker2019/mald_1_1.rds")

Now we reload the Bayesian regression model from Chapter 27. We simply use the same code we used in that chapter. If you run the code now, the file with the model will be loaded into R, so the model will not be fitted from scratch. This is both convenient and it ensure the code is reproducible.

library(brms)

rt_bm_1 <- brm(
  RT ~ IsWord,
  family = gaussian,
  data = mald,
  seed = 6725,
  file = "cache/ch-regression-cat-rt_bm_1"
)

Let’s review the model summary.

summary(rt_bm_1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ IsWord 
   Data: mald (Number of observations: 5000) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     953.14      6.12   941.17   965.17 1.00     4590     2897
IsWordFALSE   116.34      8.80    98.74   134.14 1.00     4815     3235

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   312.47      3.11   306.44   318.60 1.00     4720     3456

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The Regression Coefficients table includes Intercept and IsWordFALSE which is what we expect from the model formula and data. We are good to go! Now we can extract the MCMC draws from the model using the as_draws_df() function. This function from the posterior package returns a tibble with values obtained in each draw of the MCMC algorithm. Since we fitted the model with 4 MCMC chains and 1000 sampling draws per chain, there is a total of 4000 drawn values (i.e. 4000 rows in the data frame).

rt_bm_1_draws <- as_draws_df(rt_bm_1)
rt_bm_1_draws

Don’t worry about the Intercept, lprior and lp__ columns. Open the data frame in the RStudio viewer. You will see three extra column: .chain, .iteration and .draw. They indicate:

  • .chain: The MCMC chain number (1 to 4).
  • .iteration: The iteration number within chain (1 to 1000).
  • .draw: The draw number across all chains (1 to 4000).

Make sure that you understand these columns in light of the MCMC algorithm. The following columns contain the drawn values at each draw for three parameters of the model: b_Intercept, b_IsWordFALSE and sigma. To remind yourself what these mean, let’s have a look at the mathematical formula of the model.

\[ \begin{align} RT_i & \sim Gaussian(\mu_i, \sigma) \\ \mu_i & = \beta_0 + \beta_1 \cdot w_i \\ \end{align} \]

So:

  • b_Intercept is \(\beta_0\). This is the mean RT when the the word is a real word.
  • b_IsWordFALSE is \(\beta_1\). This is the difference in RT between nonce words and real words.
  • sigma is \(\sigma\). This is the overall standard deviation of the RTs.

Any inference made on the basis of the model are inferences derived from the draws. One could say that the model “results” are, to put it simply, these draws and that the draws can be used to make inferences about the population one is investigating (if one is interested in inference, in the first place).

25.3 Summary measures of the posterior draws

The Regression Coefficients table from the summary() of the model reports summary measures calculated from the drawn values of b_Intercept and b_IsWordFALSE. These summary measures are the mean (Estimate), the standard deviation (Est.error) and the lower and upper limits of the 95% Credible Interval (CrI). Since we have now the draws, we can of course obtain those same measures ourselves from the data frame with the draws! We will use the quantile2() function from the posterior package to obtain 95% and 80% CrIs.

library(posterior)
This is posterior version 1.6.1

Attaching package: 'posterior'
The following objects are masked from 'package:stats':

    mad, sd, var
The following objects are masked from 'package:base':

    %in%, match
rt_b1_summaries <- rt_bm_1_draws |> 
  summarise(
    intercept_mean = mean(b_Intercept), intercept_sd = sd(b_Intercept),
    intercept_lo_95 = quantile2(b_Intercept, 0.025), intercept_up_95 = quantile2(b_Intercept, 0.975),
    IsWordFALSE_mean = mean(b_IsWordFALSE), IsWordFALSE_sd = sd(b_IsWordFALSE),
    IsWordFALSE_lo_95 = quantile2(b_IsWordFALSE, 0.025), IsWordFALSE_up_95 = quantile2(b_IsWordFALSE, 0.975),
  ) |> 
  # fancy way of mutating multiple columns
  mutate(
    across(everything(), ~round(., 2))
  )
rt_b1_summaries

Compare the values obtained now with the values in the model summary above. They are the same, because the summary measures in the model summary are simply summary measures of the draws. Hopefully now it is clear where the values in the Regression Coefficient table come from and how these are related to the MCMC draws.

25.4 Plotting posterior draws

Plotting posterior draws is as straightforward as plotting any data. You already have all of the tools to understand plotting draws with ggplot2. To plot the reconstructed posterior probability distribution of any parameter, we plot the probability density (with geom_density()) of the draws of that parameter. Let’s plot b_IsWordFALSE. This will be the posterior probability density of the difference in RTs between nonce (IsWord is FALSE) and real words (IsWord is TRUE).

rt_bm_1_draws |>
  ggplot(aes(b_IsWordFALSE)) +
  geom_density() +
  geom_rug(alpha = 0.2)

25.5 Calculate posterior predictions from the draws

In Chapter 28, you learned how to use indexing of categorical predictors to estimate the mean of the outcome variable for each level in the categorical predictor. You might be wondering: with contrasts you get the mean of the reference level and the difference between the second level and the reference level; with indexing you get the mean at each level; however, all of these estimands (the mean of all levels and the difference between the levels) are important for inference. So, should you fit the model twice to obtain all estimands? Of course, that is not necessary because independent from the way you code categorical predictors, the model will have all the information needed to calculate all of the estimands of interest. This information is in the MCMC draws. In this section, you will learn how to obtain the estimated mean RT when the word is a nonce word from the model that used contrasts.

Let’s revise the formula of the mean \(\mu\) from the model formula above: \(\mu_i = \beta_0 + \beta_1 \cdot w_i\). The mean depends on the value of \(w\) (that’s what the subscript \(i\) is for). We can substitute \(w_i\) with 0 for IsWord = TRUE and with 1 for IsWord = FALSE. So we get \(\mu_T = \beta_0\) for TRUE and \(\mu_F = \beta_0 + \beta_1\). Look again at rt_bm_1_draws. Recall that b_Intercept is \(\beta_0\) and b_IsWordFALSE is \(\beta_1\). Can you figure out what you need to do to get the posterior probability distribution of \(\mu_F\) (i.e. the mean RT when IsWord is FALSE)? You sum the columns b_Intercept and b_IsWordFALSE! It goes without saying that the posterior probability distribution of the mean RT when IsWord is TRUE is the probability density of b_Intercept.

rt_bm_1_draws <- rt_bm_1_draws |> 
  mutate(
    # predicted RT for real words
    real = b_Intercept,
    # predicted HNR for polite attitude
    nonce = b_Intercept + b_IsWordFALSE
  )

rt_bm_1_draws |> select(real, nonce)
Warning: Dropping 'draws_df' class as required metadata was removed.

The sum operation for nonce is applied row-wise, to each draw (remember, each row in the draw tibble is the draw from one iteration of the MCMC chains). So, for nonce, the value of b_Intercept at draw 1 is added to the value of b_IsWordFALSE at draw 1, and so on. You end up with a list of sums that has the same length as the initial draws (here, 4000, i.e. 1000 per chain!). Then you can summarise and plot this new list as you did with the b_ coefficients earlier. To make further processing of the draws easier, we can reshape the tibble so that instead of having two columns real and nonce with the drawn values in them, we end up with a tibble with one column IsWord (where we store the value real or nonce) and a column named value with the drawn value. Reshaping tibbles is called pivoting in tidyverse parlance. We won’t go into the details of all pivoting operations, so I strongly recommend you to look through the vignette on pivoting (a vignette is a short tutorial that some R packages provide for users to learn basic operations with the functions in the package). The operation we need to get the shape we want is pivoting from a “wider” format to a “longer” format: this is achieved with pivot_longer() from tidyr. Make sure you read the section on pivot_longer() in the vignette to understand what the following code does.

rt_bm_1_draws_long <- rt_bm_1_draws |> 
  select(.chain, .iteration, .draw, real, nonce) |> 
  pivot_longer(real:nonce, names_to = "IsWord", values_to = "value")
Warning: Dropping 'draws_df' class as required metadata was removed.
rt_bm_1_draws_long

Check the resulting rt_bm_1_draws_long tibble. The first two rows are draw 1 of chain 1: one row is for the case when the word is a real word, and the other row is for the case when the word is a nonce word. There are a total of 8000 rows: 4000 draws (1000 draws times 4 chains) times 2 (two levels of IsWord). Now let’s make a violin plot of the predicted HNR in informal and polite attitude!

Figure 25.1: Density plot of predicted mean RT for real and nonce words.

We can use the ggdist package to plot something a bit fancier, for example a “half eye” geometry (which can be obtained with the “halfeye” statistic, stat_halfeye()). Check the ?stat_halfeye documentation to learn about it.

library(ggdist)

Attaching package: 'ggdist'
The following objects are masked from 'package:brms':

    dstudent_t, pstudent_t, qstudent_t, rstudent_t
rt_bm_1_draws_long |> 
  ggplot(aes(value, IsWord)) +
  stat_halfeye() +
  labs(x = "Predicted RT (ms)", y = "Word type")
Figure 25.2: Half-eye plot of predicted mean RT for real and nonce words.

Another useful ggplot2 statistic is stat_interval().

rt_bm_1_draws_long |> 
  ggplot(aes(IsWord, value)) +
  stat_interval() +
  labs(
    x = "Word type", y = "Predicted RT (ms)"
  )
Figure 25.3: Credible Intervals of predicted mean RT for real and nonce words.

When reporting the results, it is recommended to report the CrIs of the predicted outcome for each level in the categorical predictor. It is useful to include tables like Table 25.1. To make the generation of the table more straightforward, we can use a couple of tricks when creating the rt_bm_cris below. The glue() function from the tidyverse package glue allows you to specify strings based on the output of R code computation. Code is included between curly braces {}.

library(glue)

rt_bm_1_cris <- rt_bm_1_draws_long |> 
  mutate(IsWord = factor(IsWord, levels = c("real", "nonce"))) |> 
  group_by(IsWord) |> 
  summarise(
    mean = mean(value), sd = sd(value),
    ci_90 = glue("[{round(quantile2(value, 0.05))}, {round(quantile2(value, 0.95))}]"),
    ci_80 = glue("[{round(quantile2(value, 0.1))}, {round(quantile2(value, 0.9))}]"),
    ci_60 = glue("[{round(quantile2(value, 0.2))}, {round(quantile2(value, 0.8))}]"),
  )
knitr::kable(rt_bm_1_cris, col.names = c("Type", "Mean RT", "SD", "90% CrI", "80% CrI", "60% CrI"), digits = 0)
Table 25.1: Summary measures of predicted RTs by word type (in milliseconds).
Type Mean RT SD 90% CrI 80% CrI 60% CrI
real 953 6 [943, 963] [945, 961] [948, 958]
nonce 1069 6 [1059, 1080] [1062, 1077] [1064, 1075]

In the main text, you could report the results like so:

According to the model, the predicted RTs when the word is real word is between 1059 and 1080 ms at 90% confidence (\(\beta\) = 1069, SD = 6). When the word is a nonce word, the predicted RTs are between 943 and 963 ms (\(\beta\) = 953, SD = 6).