class: center, middle, inverse, title-slide .title[ # How to Bayes: Priors ] .author[ ### Stefano Coretta ] .institute[ ### UoE LEL ] .date[ ### 2022-05-23 ] --- # Bayesian belief update .center[  ] ??? How do we ensure that prior knowledge is not lost? You've seen in the previous sessions that Bayesian analysis is about estimating the posterior probability distribution of a variable of interest. Roughly speaking, the posterior probability distribution is the combination of the prior belief and the evidence derived from the data. Prior and evidence are, of course, probability distributions. --- # Example from last time [Massive Auditory Lexical Decision](https://aphl.artsrn.ualberta.ca/?page_id=827) (Tucker et al. 2019): - **MALD data set**: 521 subjects, RTs and accuracy. - Subset of MALD: 30 subjects, 100 observations each. -- ```r mald ``` ``` ## # A tibble: 3,000 × 7 ## Subject Item IsWord PhonLev RT ACC RT_log ## <chr> <chr> <fct> <dbl> <int> <lgl> <dbl> ## 1 15345 nihnaxr FALSE 5.84 945 TRUE 6.85 ## 2 15345 skaep FALSE 6.03 1046 FALSE 6.95 ## 3 15345 grandparents TRUE 10.3 797 TRUE 6.68 ## 4 15345 sehs FALSE 5.88 2134 TRUE 7.67 ## 5 15345 cousin TRUE 5.78 597 TRUE 6.39 ## 6 15345 blowup TRUE 6.03 716 TRUE 6.57 ## 7 15345 hhehrnzmaxn FALSE 7.30 1985 TRUE 7.59 ## 8 15345 mantic TRUE 6.21 1591 TRUE 7.37 ## 9 15345 notable TRUE 6.82 620 TRUE 6.43 ## 10 15345 prowthihviht FALSE 7.68 1205 TRUE 7.09 ## # … with 2,990 more rows ``` --- # Let's start small .f2[ Which probability distribution we can expect RTs to have in a standard lexical decision task? ] --- # Distribution of RTs .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` ] ??? This formula tells us that `RT`s are distributed according to (`~`) a Gaussian distribution with mean `\(\mu\)` and standard deviation `\(\sigma\)`. -- <br> <br> <br> Which values do `\(\mu\)` and `\(\sigma\)` have? -- Pick a `\(\mu\)` and `\(\sigma\)`. (OK, but how?) --- layout: true # The empirical rule --- <img src="index_files/figure-html/empirical-rule-1-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/empirical-rule-2-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/empirical-rule-3-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/empirical-rule-4-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/empirical-rule-5-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> ??? As a general rule, `\(\pm2\sigma\)` covers 95% of the Gaussian distribution, which means we are 95% confident that the value lies within that range. --- layout: false # Distribution of RTs .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` ] <br> <br> <br> Pick a `\(\mu\)` and `\(\sigma\)`. Report them here: <https://forms.gle/cqBNtDvZ6NdnpPo79> --- class: center middle inverse .f1[Now let's see what we got!] --- # Distribution of RTs .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu = ...?$$` ] ??? We do have an idea of what the mean RT could be like but we are not certain. When you are not certain, you make a list of values and their probability, i.e. a probability distribution! --- # Prior belief as probability distributions .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(\mu_1, \sigma_1)$$` ] ??? Usually, we assume the mean `\(\mu\)` to be a value taken from another Gaussian distribution (with its own mean and SD). This Gaussian distribution is the **prior probability distribution** (or simply prior) of the mean. --- # Prior belief as probability distributions .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(\mu_1 = 0, \sigma_1)$$` ] ??? We will talk about different types of priors later. For now it's sufficient to remember that a conservative approach (which is what we want to do) is to set `\(\mu_1\)` to 0. -- <br> <br> **What about `\(\sigma_1\)`?** --- # Prior belief as probability distributions .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(\mu_1 = 0, \sigma_1 = 1500)$$` ] ??? Let's be generous and assume that the mean RT will definitely not be greater than 3 seconds. (This might seem like a very high number, but we will see how it can help estimation) `\(3000/2 = 1500\)` ms (remember, two times the SD), so we can set `\(\sigma_1 = 1500\)`. --- # Seeing is believing <img src="index_files/figure-html/mu1-prior-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> ??? Visualising priors is important, because it's easier to grasp their meaning when you can actually see the shape of the distribution. Are you wondering about the negative values? (RT cannot be negative!) I will tell more about this and how to "fix" it later. WARNING: The distribution that you see here **is not** the distribution of RTs! It's the prior distribution of the mean of that distribution. --- # Prior belief as probability distributions .f2[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(0, 1500)$$` ] -- .f6[ We'll talk about `\(\sigma\)` in a second... ] --- # Distribution of RTs .f2[ .center[ `RT ~ 1, family = gaussian` ] `$$\mu \sim Normal(0, 1500)$$` ] -- <br> <br> <br> How do we specify priors? ??? Before we move on onto specify priors, it is helpful to see how priors are stored in R for use with brms. --- # Get the default priors ```r get_prior( RT ~ 1, data = mald, family = gaussian ) ``` ``` ## prior class coef group resp dpar nlpar lb ub source ## student_t(3, 952, 220.9) Intercept default ## student_t(3, 0, 220.9) sigma 0 default ``` ??? You can get the default priors for the specified model with `get_prior()`. The important bits of the output are the `prior` and `class` columns. We will see their use in a second. --- # Set your priors ```r my_priors <- c( prior(...), # Intercept (i.e. mu) prior(...) # sigma ) ``` --- # Set your priors ```r my_priors <- c( prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu) prior(...) # sigma ) ``` --- # What about `\(\sigma\)`? <br> <br> <br> .pull-left[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(\mu_1 = 0, \sigma_1 = 1500)$$` `$$\sigma = ...?$$` ] .pull-right[ .center[ `RT ~ 1, family = gaussian` `prior(normal(0, 1500), class = Intercept)` ...? ] ] ??? Now, what about `\(\sigma\)`? (Be careful not to mix up `\(\sigma\)` and `\(\sigma_1\)`! This is `\(\sigma\)` from the first line, not `\(\sigma_1\)` from the second line). --- # Prior for `\(\sigma\)`: Half Cauchy <img src="index_files/figure-html/cauchy-1.png" height="500px" style="display: block; margin: auto;" /> ??? ```r HDInterval::inverseCDF( c(0.5, 0.75, 0.95), phcauchy, sigma = 50 ) ``` ``` ## [1] 50.0000 120.7107 635.3102 ``` ```r HDInterval::inverseCDF( c(0.5, 0.75, 0.95), phcauchy, sigma = 100 ) ``` ``` ## [1] 100.0000 241.4214 1270.6205 ``` --- # Distribution of RTs <br> <br> <br> .pull-left[ `$$\text{RT}_i \sim Normal(\mu, \sigma)$$` `$$\mu \sim Normal(\mu_1 = 0, \sigma_1 = 1500)$$` `$$\sigma = HalfCauchy(\mu_2 = 0, \sigma_2 = 100)$$` ] .pull-right[ .center[ `RT ~ 1, family = gaussian` `prior(normal(0, 1500), class = Intercept)` `prior(cauchy(0, 100), class = sigma)` ] ] --- layout: true # Prior Predictive Checks --- ```r my_priors <- c( prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu) prior(cauchy(0, 100), class = sigma) # sigma ) rt_1_pp <- brm( RT ~ 1, data = mald, family = gaussian, prior = my_priors, sample_prior = "only", cores = my_cores, file = "data/cache/rt_1_pp", backend = "cmdstanr" ) ``` --- <img src="index_files/figure-html/rt-1-pp-plot-1.png" width="800px" height="500px" style="display: block; margin: auto;" /> --- layout: false # Let's run the model ```r my_priors <- c( prior(normal(0, 1500), class = Intercept), # Intercept (i.e. mu) prior(cauchy(0, 100), class = sigma) # sigma ) rt_1 <- brm( RT ~ 1, data = mald, family = gaussian, prior = my_priors, cores = my_cores, file = "data/cache/rt_1", backend = "cmdstanr" ) ``` --- # Distribution of RTs ```r rt_1 ``` ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: RT ~ 1 ## Data: mald (Number of observations: 3000) ## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup draws = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 1046.59 6.51 1033.74 1059.55 1.00 3274 2453 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 347.62 4.44 339.20 356.35 1.00 3833 3001 ## ## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` --- # Distribution of RTs <img src="index_files/figure-html/rt-1-ppc-1.png" height="500px" style="display: block; margin: auto;" /> --- class: inverse center bottom background-image: url("img/chris-robert-unsplash.jpg") # BREAK --- # PP check didn't look good... <img src="index_files/figure-html/rt-1-ppc-2-1.png" height="500px" style="display: block; margin: auto;" /> ??? Why is that? Because our prior believes were a bit off... --- # RT values are not Gaussian <br> Reaction times are not Gaussian (i.e. they are **not generated by a Gaussian process**). -- There are several proposed distribution families for RTs. -- The most notable: - **Log-normal** distribution. - **ExGaussian** or Exponential-Gaussian distribution. ??? Note that fitting a Gaussian distribution to log-transformed values is equivalent to fitting a log-normal distribution to non-transformed values. --- class: center middle inverse .f1[How do you specify priors on the log scale?] --- # Accuracy and phonemic distance ```r mald ``` ``` ## # A tibble: 3,000 × 7 ## Subject Item IsWord PhonLev RT ACC RT_log ## <chr> <chr> <fct> <dbl> <int> <lgl> <dbl> ## 1 15345 nihnaxr FALSE 5.84 945 TRUE 6.85 ## 2 15345 skaep FALSE 6.03 1046 FALSE 6.95 ## 3 15345 grandparents TRUE 10.3 797 TRUE 6.68 ## 4 15345 sehs FALSE 5.88 2134 TRUE 7.67 ## 5 15345 cousin TRUE 5.78 597 TRUE 6.39 ## 6 15345 blowup TRUE 6.03 716 TRUE 6.57 ## 7 15345 hhehrnzmaxn FALSE 7.30 1985 TRUE 7.59 ## 8 15345 mantic TRUE 6.21 1591 TRUE 7.37 ## 9 15345 notable TRUE 6.82 620 TRUE 6.43 ## 10 15345 prowthihviht FALSE 7.68 1205 TRUE 7.09 ## # … with 2,990 more rows ``` --- layout: true # Accuracy and phonemic distance --- <img src="index_files/figure-html/acc-pd-1.png" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/acc-pd-2-1.png" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/acc-pd-3-1.png" height="500px" style="display: block; margin: auto;" /> --- layout: false # Distribution of accuracy data by phonemic distance .f2[ `$$\text{ACC}_i \sim Bernoulli(p)$$` ] -- .f2[ `$$logit(p) = \beta_0 + \beta_1\text{PL}_i$$` ] -- .f2[ `$$\beta_0 \sim Normal(\mu_0, \sigma_0)$$` ] -- .f2[ `$$\beta_1 \sim Normal(\mu_1, \sigma_1)$$` ] -- <br> But we are on the log scale now, because of the `logit(p)` part! -- So priors have to be specified on the log scale. --- # Log-odds and probabilities <img src="index_files/figure-html/p-log-odds-1.png" height="500px" style="display: block; margin: auto;" /> --- # Distribution of accuracy data by phonemic distance .f2[ `$$\text{ACC}_i \sim Bernoulli(p)$$` `$$logit(p) = \beta_0 + \beta_1\text{PL}_i$$` `$$\beta_0 \sim Normal(0, 1.5)$$` `$$\beta_1 \sim Normal(0, 1)$$` ] ??? Remember, `Normal(0, 1.5)` corresponds to 95% CrI [-3, +3]. Which in probability terms is approximately [5%, 95%]. `Normal(0, 1)` corresponds to 95% CrI [-2, +2], which is approximately 95% CrI [0.15, 7.5] in odds. --- # Accuracy and phonemic distance ```r priors <- c( prior(normal(0, 1.5), class = Intercept), # beta_0 prior(normal(0, 1), class = b, coef = PhonLev) # beta_1 ) acc_1 <- brm( ACC ~ PhonLev, data = mald, family = bernoulli, prior = priors, cores = my_cores, file = "data/cache/acc_1", backend = "cmdstanr" ) ``` --- # Accuracy and phonemic distance ```r conditional_effects(acc_1, "PhonLev") ``` <img src="index_files/figure-html/acc-1-cond-1.png" height="500px" style="display: block; margin: auto;" /> --- # Accuracy and phonemic distance by word type ```r priors <- c( prior(normal(0, 1.5), class = Intercept), # beta_0 prior(normal(0, 1), class = b, coef = PhonLev), # beta_1 prior(normal(0, 1), class = b, coef = IsWordFALSE), # beta_2 prior(normal(0, 1), class = b, coef = `PhonLev:IsWordFALSE`) # beta_3 ) acc_2 <- brm( ACC ~ PhonLev * IsWord, data = mald, family = bernoulli, prior = priors, cores = my_cores, file = "data/cache/acc_2", backend = "cmdstanr" ) ``` --- # Accuracy and phonemic distance by word type ```r conditional_effects(acc_2, "PhonLev:IsWord") ``` <img src="index_files/figure-html/acc-2-cond-1.png" height="500px" style="display: block; margin: auto;" /> --- class: inverse center bottom background-image: url("img/chris-robert-unsplash.jpg") # Questions?