How to Bayes: Priors

.title[
# How to Bayes: Priors
]
.author[
### Stefano Coretta
]
.institute[
### UoE LEL
]
.date[
### 2022-05-23
]

---

# Bayesian belief update

???

How do we ensure that prior knowledge is not lost?

You've seen in the previous sessions that Bayesian analysis is about estimating the posterior probability distribution of a variable of interest.

Roughly speaking, the posterior probability distribution is the combination of the prior belief and the evidence derived from the data.

Prior and evidence are, of course, probability distributions.

---

# Example from last time

[Massive Auditory Lexical Decision](https://aphl.artsrn.ualberta.ca/?page_id=827) (Tucker et al. 2019):

- **MALD data set**: 521 subjects, RTs and accuracy.

- Subset of MALD: 30 subjects, 100 observations each.

```r
mald
```

```
## # A tibble: 3,000 × 7
## Subject Item IsWord PhonLev RT ACC RT_log
## <chr> <chr> <fct> <dbl> <int> <lgl> <dbl>
## 1 15345 nihnaxr FALSE 5.84 945 TRUE 6.85
## 2 15345 skaep FALSE 6.03 1046 FALSE 6.95
## 3 15345 grandparents TRUE 10.3 797 TRUE 6.68
## 4 15345 sehs FALSE 5.88 2134 TRUE 7.67
## 5 15345 cousin TRUE 5.78 597 TRUE 6.39
## 6 15345 blowup TRUE 6.03 716 TRUE 6.57
## 7 15345 hhehrnzmaxn FALSE 7.30 1985 TRUE 7.59
## 8 15345 mantic TRUE 6.21 1591 TRUE 7.37
## 9 15345 notable TRUE 6.82 620 TRUE 6.43
## 10 15345 prowthihviht FALSE 7.68 1205 TRUE 7.09
## # … with 2,990 more rows
```

---

# Let's start small

---

# Distribution of RTs

???

This formula tells us that `RT`s are distributed according to (`~`) a Gaussian distribution with mean `$\mu$` and standard deviation `$\sigma$`.

Which values do `$\mu$` and `$\sigma$` have?

Pick a `$\mu$` and `$\sigma$`. (OK, but how?)

---

# The empirical rule

---

---

---

---

---

???

As a general rule, `$\pm2\sigma$` covers 95% of the Gaussian distribution, which means we are 95% confident that the value lies within that range.

---

# Distribution of RTs

Pick a `$\mu$` and `$\sigma$`.

Report them here: <https://forms.gle/cqBNtDvZ6NdnpPo79>

---

---

# Distribution of RTs

???

We do have an idea of what the mean RT could be like but we are not certain.

When you are not certain, you make a list of values and their probability, i.e. a probability distribution!

---

# Prior belief as probability distributions

???

Usually, we assume the mean `$\mu$` to be a value taken from another Gaussian distribution (with its own mean and SD).

This Gaussian distribution is the **prior probability distribution** (or simply prior) of the mean.

---

# Prior belief as probability distributions

???

We will talk about different types of priors later. For now it's sufficient to remember that a conservative approach (which is what we want to do) is to set `$\mu_1$` to 0.

**What about `$\sigma_1$`?**

---

# Prior belief as probability distributions

???

Let's be generous and assume that the mean RT will definitely not be greater than 3 seconds. (This might seem like a very high number, but we will see how it can help estimation)

`$3000/2 = 1500$` ms (remember, two times the SD), so we can set `$\sigma_1 = 1500$`.

---

# Seeing is believing

???

Visualising priors is important, because it's easier to grasp their meaning when you can actually see the shape of the distribution.

Are you wondering about the negative values? (RT cannot be negative!) I will tell more about this and how to "fix" it later.

WARNING: The distribution that you see here **is not** the distribution of RTs! It's the prior distribution of the mean of that distribution.

---

# Prior belief as probability distributions

---

# Distribution of RTs

`$$\mu \sim Normal(0, 1500)$$`
]

How do we specify priors?

???

Before we move on onto specify priors, it is helpful to see how priors are stored in R for use with brms.

---

# Get the default priors

```r
get_prior(
  RT ~ 1,
  data = mald,
  family = gaussian
)
```

```
##                     prior     class coef group resp dpar nlpar lb ub  source
##  student_t(3, 952, 220.9) Intercept                                  default
##    student_t(3, 0, 220.9)     sigma                             0    default
```

???

You can get the default priors for the specified model with `get_prior()`.
The important bits of the output are the `prior` and `class` columns.
We will see their use in a second.

---

# Set your priors

```r
my_priors <- c(
 prior(...), # Intercept (i.e. mu)
 prior(...) # sigma
)
```

---

# Set your priors

```r
my_priors <- c(
 prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu)
 prior(...) # sigma
)
```

---

# What about `$\sigma$`?

.pull-left[
`$$\text{RT}_i \sim Normal(\mu, \sigma)$$`
`$$\mu \sim Normal(\mu_1 = 0, \sigma_1 = 1500)$$`
`$$\sigma = ...?$$`
]

`prior(normal(0, 1500), class = Intercept)`

...?
]
]

???

Now, what about `$\sigma$`? (Be careful not to mix up `$\sigma$` and `$\sigma_1$`! This is `$\sigma$` from the first line, not `$\sigma_1$` from the second line).

---

# Prior for `$\sigma$`: Half Cauchy

???

```r
HDInterval::inverseCDF(
  c(0.5, 0.75, 0.95),
  phcauchy,
  sigma = 50
)
```

```
## [1]  50.0000 120.7107 635.3102
```

```r
HDInterval::inverseCDF(
  c(0.5, 0.75, 0.95),
  phcauchy,
  sigma = 100
)
```

```
## [1]  100.0000  241.4214 1270.6205
```

---

# Distribution of RTs

.pull-left[
`$$\text{RT}_i \sim Normal(\mu, \sigma)$$`
`$$\mu \sim Normal(\mu_1 = 0, \sigma_1 = 1500)$$`
`$$\sigma = HalfCauchy(\mu_2 = 0, \sigma_2 = 100)$$`
]

`prior(normal(0, 1500), class = Intercept)`

`prior(cauchy(0, 100), class = sigma)`
]
]

---

# Prior Predictive Checks

---

```r
my_priors <- c(
 prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu)
 prior(cauchy(0, 100), class = sigma) # sigma
)

rt_1_pp <- brm(
 RT ~ 1,
 data = mald,
 family = gaussian,
 prior = my_priors,
 sample_prior = "only",
 cores = my_cores,
 file = "data/cache/rt_1_pp",
 backend = "cmdstanr"
)
```

---

---

# Let's run the model

```r
my_priors <- c(
 prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu)
 prior(cauchy(0, 100), class = sigma) # sigma
)

rt_1 <- brm(
 RT ~ 1,
 data = mald,
 family = gaussian,
 prior = my_priors,
 cores = my_cores,
 file = "data/cache/rt_1",
 backend = "cmdstanr"
)
```

---

# Distribution of RTs

```r
rt_1
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: RT ~ 1 
##    Data: mald (Number of observations: 3000) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept  1046.59      6.51  1033.74  1059.55 1.00     3274     2453
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma   347.62      4.44   339.20   356.35 1.00     3833     3001
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

---

# Distribution of RTs

---

class: inverse center bottom
background-image: url("img/chris-robert-unsplash.jpg")

# BREAK

---

# PP check didn't look good...

???

Why is that?

Because our prior believes were a bit off...

---

# RT values are not Gaussian

Reaction times are not Gaussian (i.e. they are **not generated by a Gaussian process**).

There are several proposed distribution families for RTs.

The most notable:

- **Log-normal** distribution.

- **ExGaussian** or Exponential-Gaussian distribution.

???

Note that fitting a Gaussian distribution to log-transformed values is equivalent to fitting a log-normal distribution to non-transformed values.

---

---

# Accuracy and phonemic distance

```r
mald
```

---

# Accuracy and phonemic distance

---

---

---

---

# Distribution of accuracy data by phonemic distance

But we are on the log scale now, because of the `logit(p)` part!

So priors have to be specified on the log scale.

---

# Log-odds and probabilities

---

# Distribution of accuracy data by phonemic distance

`$$logit(p) = \beta_0 + \beta_1\text{PL}_i$$`

`$$\beta_0 \sim Normal(0, 1.5)$$`

`$$\beta_1 \sim Normal(0, 1)$$`
]

???

Remember, `Normal(0, 1.5)` corresponds to 95% CrI [-3, +3]. Which in probability terms is approximately [5%, 95%].

`Normal(0, 1)` corresponds to 95% CrI [-2, +2], which is approximately 95% CrI [0.15, 7.5] in odds.

---

# Accuracy and phonemic distance

```r
priors <- c(
 prior(normal(0, 1.5), class = Intercept), # beta_0
 prior(normal(0, 1), class = b, coef = PhonLev) # beta_1
)

acc_1 <- brm(
 ACC ~ PhonLev,
 data = mald,
 family = bernoulli,
 prior = priors,
 cores = my_cores,
 file = "data/cache/acc_1",
 backend = "cmdstanr"
)
```

---

# Accuracy and phonemic distance

```r
conditional_effects(acc_1, "PhonLev")
```

---

# Accuracy and phonemic distance by word type

```r
priors <- c(
 prior(normal(0, 1.5), class = Intercept), # beta_0
 prior(normal(0, 1), class = b, coef = PhonLev), # beta_1
 prior(normal(0, 1), class = b, coef = IsWordFALSE), # beta_2
 prior(normal(0, 1), class = b, coef = `PhonLev:IsWordFALSE`) # beta_3
)

acc_2 <- brm(
 ACC ~ PhonLev * IsWord,
 data = mald,
 family = bernoulli,
 prior = priors,
 cores = my_cores,
 file = "data/cache/acc_2",
 backend = "cmdstanr"
)
```

---

# Accuracy and phonemic distance by word type

```r
conditional_effects(acc_2, "PhonLev:IsWord")
```

---

class: inverse center bottom
background-image: url("img/chris-robert-unsplash.jpg")

# Questions?