Statistics and Quantitative Methods (S1)

.title[
# Statistics and Quantitative Methods (S1)
]
.subtitle[
## Week 5
]
.author[
### Dr Stefano Coretta
]
.institute[
### University of Edinburgh
]
.date[
### 2022/10/08
]

---

# SUMMARY

---

# Summary

* The simplest .orange[**linear model**] is a straight line.

`$$y = \beta_0 + \beta_1 x$$`

* We want to estimate the `$\beta_n$` .orange[**coefficients**].

`$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$$`

* Categorical predictors are .orange[coded as numbers].

* The default coding system ("treatment contrasts") sets the intercept as the mean of the first level (the "reference level").

* The other levels of the categorical predictor are compared to the reference level.

---

![:scale 30%](../../img/charlesdeluvio-D44HIk-qsvI-unsplash.jpg)

???

Photo by <a href="https://unsplash.com/@charlesdeluvio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">charlesdeluvio</a> on <a href="https://unsplash.com/s/photos/sad-puppy?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

---

# Exercise 1

---

```
## 
## Call:
## lm(formula = articulation_rate ~ attitude + musicstudent, data = polite)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.2246 -0.6739 -0.1175 0.6190 5.9630 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0522 0.1219 57.874 < 2e-16
## attitudepol -0.4088 0.1511 -2.705 0.00737
## musicstudentyes -0.4470 0.1561 -2.864 0.00459
## 
## Residual standard error: 1.131 on 221 degrees of freedom
## Multiple R-squared: 0.06561,	Adjusted R-squared: 0.05715 
## F-statistic: 7.759 on 2 and 221 DF, p-value: 0.0005538
```

---

---

# Exercise 2

---

```
## 
## Call:
## lm(formula = f0mn ~ attitude + months_ger + gender, data = polite)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -121.994 -26.368 -5.904 20.204 163.443 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 259.52250 5.07675 51.120 <2e-16
## attitudepol -14.71441 5.31042 -2.771 0.0061
## months_ger -0.07929 0.04427 -1.791 0.0747
## genderM -119.08249 5.61931 -21.192 <2e-16
## 
## Residual standard error: 38.65 on 208 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.6965,	Adjusted R-squared: 0.6921 
## F-statistic: 159.1 on 3 and 208 DF, p-value: < 2.2e-16
```

---

---

![:scale 80%](../../img/joe-caione-qO-PIF84Vxg-unsplash.jpg)

???

Photo by <a href="https://unsplash.com/@joeyc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Joe Caione</a> on <a href="https://unsplash.com/s/photos/happy-puppy?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

---

# Shallow Morphological Processing

---

- English L1 and L2 speakers (L2 speakers are native speakers of Cantonese).

- Lexical decision task (Word vs Non-Word).

- Target: *unkindness* ([[un]-[kind]]-ness).

- Primes: *prolong* (Unrelated), *unkind* (Constituent), *kindness* (Non-Constituent).

---

```
## # A tibble: 1,950 × 3
## ID accuracy Relation_type 
## <chr> <fct> <fct> 
## 1 L1_01 correct Unrelated 
## 2 L1_01 correct Constituent 
## 3 L1_01 correct Unrelated 
## 4 L1_01 correct Constituent 
## 5 L1_01 incorrect Unrelated 
## 6 L1_01 correct Unrelated 
## 7 L1_01 correct Constituent 
## 8 L1_01 correct NonConstituent
## 9 L1_01 correct NonConstituent
## 10 L1_01 correct Constituent 
## # … with 1,940 more rows
```

---

# This doesn't work!

```r
shallow_lm <- lm(accuracy ~ Word_Nonword, data = shallow)
```

```
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
 contrasts can be applied only to factors with 2 or more levels
```

What is the difference between this model and the models we have run so far?

(**NOTE:** the error message is absolutely NOT helpful!)

**Hint:** Compare the the outcome variables.

---

# PROBABILITY DISTRIBUTIONS

---

???

We are faced every day with probabilities. Just think about the weather forecast.

We say things like t"here is a 70% prob that it will rain today". In this sense, probability is the probability of an event occurring.

But what about more complex situations that are not a flip-of-coin kinda situation? For example what about rolling two dice?

Here is where probability distributions come in.

---

# Grubabilities

&nbsp;

???

A probability distribution is a list of values and their corresponding probability.

---

# Discrete and continuous

???

Remember we talked about continuous and discrete variables?

This distinction is helpful not only when deciding which type of plot to use, but also which type of linear model to use!

Or more specifically, which probability distribution to use for the outcome variable. This week's classes will be about this!

Depending on the nature of the values a variable can take, there are 2 types of probs.

---

# Discrete probability distributions

???

A discrete probability distributions is like counting how many ways you can get a particular value.

For example, if you roll a white and a black dice, there are 3 ways to get a 4 or a 10, but 6 ways to get a 7.

---

# Continuous probability distributions

---

???

With continuous probabilities we cannot make a list of all the possible values (0.0, 0.00, 0.000, 0.0001...), because there is an infinite number of possible values. So we cannot assign a probability to a specific value.

Instead, we assign probabilities to a range of values.

---

???

In this example, we want to know the probability of observing an f0 value between 0 and 160 Hz, assuming the probability distribution represented in the graph.

We simply calculate the area under the curve between those two values (note that the total area under the curve is 1).

The probability of f0 being less than 160 Hz is 0.212.

---

???

The probability of f0 being greater than 220 Hz is 0.345.

---

???

The probability of f0 being between 120 and 210 Hz is 0.524

---

But how do we describe probability distributions in a succinct way?

We can't make a list of all values and probabilities, especially for continuous probabilities.

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[

Instead, we specify the value of the **parameters that describe the distribution**.

]

---

.f1.link.dim.br3.ph3.pv2.mb2.dib.white.bg-purple[
[.white[Web App: Probability distributions]](https://seeing-theory.brown.edu/probability-distributions/index.html#section2)
]

---
class: middle

$$y_i \sim Normal(\mu, \sigma)$$

???

Let's look at some formulae.

This is the formula of a variable `$y_1$` that is distributed according to (~) a Normal probability distribution.

As we have seen in the example above, a Normal distribution can be described with two parameters: the mean and the standard deviation.

---

$$\text{f0}_i \sim Normal(200, 50)$$

???

Remember the example above of a Gaussian/Normal distribution of f0?

We can describe that distribution with this formula (much easier than listing all the values and their probability).

---

# Think about the probability distribution of the outcome variable

---

# Common probability distributions

* The variable can take on *any positive and negative real number, including 0*: **Gaussian** (aka normal) distribution.

* There are very few truly Gaussian variables, although in some cases one can speak of "approximate" or "assumed" normality.
    
    * This distribution family is fitted by default in `lm(...)`.
]

* The variable is *dichotomous*, i.e. it can take one of two levels: **Bernoulli** distribution.
  * Categorical outcome variables like yes/no, correct/incorrect, voiced/voiceless, follow this distribution.
  
  * This family is fitted when you run `glm(..., family = binomial)`, aka "logistic regression" or "binomial regression".

* The variable is *counts*: **Poisson** distribution.
  * Counts of words, segments, gestures, f0 peaks, ...
  
  * This family can be fitted with `glm(..., family = poisson)`.
]

???

Note that `glm()` stands for Generalised Linear Model.

It's called "generalised" because the maths behind it generalises the use of linear models with the Gaussian distribution family to other distribution families.

But from a practical point of view, these are all linear models, whether you fit them with `lm()` or `glm()`.

---

# So far we used Gaussian distributions

`lm()` uses a Gaussian distribution by default.

But most variables are not Gaussian (in fact, there aren't any truly Gaussian variables, so much for the "normal" distribution).

---

# DICHOTOMOUS VARIABLES: ACCURACY

---

# Dichotomous variables: Accuracy

---

???

You can clearly see that `"correct"` is much more frequent than `"incorrect"`.

In other words, the probability of getting `"correct"` is greater than the probability of getting `"incorrect"`.

---

???

What happens if we separate by `Releation_type`?

You can still see that `"correct"` is more frequent than `"incorrect"` in all relation types, but since each type has a different number of observations it is not easy to compare *across* types.

---

???

In this plot, we use `position = "fill"` to plot *proportions* rather than raw counts.

Now you can see that, proportionally, there are more correct responses in `Constituent` vs `NonConstituent` and that `Unrelated` is sorta mid-way through the other two.

---

.bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[

When the outcome variable is **dichotomous**, we need to estimate the **probability** of getting either level in the variable.

]

In the `shallow` data:

- `accuracy` is dichotomous with levels `"incorrect"` and `"correct"`
- `Relation_type` is discrete with three levels: `Unrelated`, `Constituent`, `NonConstituent`.

So, we want to know (i.e. estimate) the probability of getting a `"correct"` response depending on `Relation_type`.

We need to use `glm()` and `family = binomial()`!

---

```
## 
## Call:
## glm(formula = accuracy ~ Relation_type, family = binomial(), 
## data = shallow)
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -2.0269 0.5238 0.5238 0.6438 0.7518 
## 
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4684 0.0918 15.997 < 2e-16
## Relation_typeConstituent 0.4485 0.1411 3.179 0.00148
## Relation_typeNonConstituent -0.3492 0.1492 -2.341 0.01921
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## Null deviance: 1810.9 on 1949 degrees of freedom
## Residual deviance: 1784.8 on 1947 degrees of freedom
## AIC: 1790.8
## 
## Number of Fisher Scoring iterations: 4
```

???

Here's the summary of `shallow_lm_1`.

Do you notice something weird?

---

# What is the unit of the estimates?!

???

Should be probabilities, because we are estimating probabilities.

But those cannot be probabilities, because probabilities are between 0 and 1.

---

# Probabilities as log-odds

???

Linear models cannot work with probabilities!

So we need to transform probabilities into something the model can work with.

And that something is log-odds!

---

# Probabilities as log-odds

---

---

To transform log-odds to probabilities you can use the `plogis()` function!

```r
# What is the probability at -1 log-odds?
plogis(-1)
```

```
## [1] 0.2689414
```

```r
# What is the probability at 0 log-odds?
plogis(0)
```

```
## [1] 0.5
```

```r
# What is the probability at 1 log-odds?
plogis(1)
```

```
## [1] 0.7310586
```

Now try different numbers with `plogis()`.

---

# Dichotomous variables: Accuracy

---

```
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.47 0.0918 16.0 1.35e-57
## 2 Relation_typeConstituent 0.448 0.141 3.18 1.48e- 3
## 3 Relation_typeNonConstituent -0.349 0.149 -2.34 1.92e- 2
```

- `Intercept`
  
  - log-odds = 1.4684
  - `plogis(1.4684)` = 0.81
  - i.e. 81% probability of getting `"correct"` *when* `Relation_type` is `"Unrelated"`.

- What about `Relation_typeConstituent` and `Relation_typeNonConstituent`?

.f3[As with the estimates of discrete predictors in `lm()` these tell you the difference between `Intercept` and the predictor level, **but in log-odds** rather than probabilities.]

---

We need to **add** the estimate to the intercept to calculate probabilities!

- `Relation_typeConstituent`

- log-odds = 0.4485.
  - `plogis(1.4684 + 0.4485)` = 0.87
  - i.e. when `Relation_type` is `Constituent`, we go from 81% to 87% probability of getting `"correct"`.
  - In other words, there is a probability increase of about 6%.
  
--

- `Relation_typeNonConstituent`

- log-odds = -0.3492.
  - `plogis(1.4684 + -0.3492)` = 0.75
  - i.e. when `Relation_type` is `NonConstituent`, we go from 81% to 75% probability of getting `"correct"`.
  - In other words, there is a probability decrease of about 6%.

---

```r
ggpredict(shallow_lm_1, terms = "Relation_type")
```

```
## # Predicted probabilities of accuracy
## 
## Relation_type  | Predicted |       95% CI
## -----------------------------------------
## Unrelated      |      0.81 | [0.78, 0.84]
## Constituent    |      0.87 | [0.85, 0.89]
## NonConstituent |      0.75 | [0.71, 0.79]
```

---

---

```
## 
## Call:
## glm(formula = accuracy ~ Relation_type + Group, family = binomial(), 
## data = shallow)
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -2.0946 0.4863 0.5527 0.6784 0.7909 
## 
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6259 0.1173 13.855 < 2e-16
## Relation_typeConstituent 0.4495 0.1412 3.183 0.00146
## Relation_typeNonConstituent -0.3503 0.1494 -2.345 0.01903
## GroupL2 -0.2738 0.1224 -2.237 0.02526
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## Null deviance: 1810.9 on 1949 degrees of freedom
## Residual deviance: 1779.7 on 1946 degrees of freedom
## AIC: 1787.7
## 
## Number of Fisher Scoring iterations: 4
```

---

```
## # Predicted probabilities of accuracy
## 
## # Group = L1
## 
## Relation_type  | Predicted |       95% CI
## -----------------------------------------
## Unrelated      |      0.84 | [0.80, 0.86]
## Constituent    |      0.89 | [0.86, 0.91]
## NonConstituent |      0.78 | [0.73, 0.82]
## 
## # Group = L2
## 
## Relation_type  | Predicted |       95% CI
## -----------------------------------------
## Unrelated      |      0.79 | [0.76, 0.83]
## Constituent    |      0.86 | [0.83, 0.88]
## NonConstituent |      0.73 | [0.68, 0.78]
```

---

---

# COUNTS: CONTINGENT TALKS

---

# Counts: Contingent talks

---

```
## # A tibble: 1,620 × 4
## dyad background count ct
## <chr> <chr> <dbl> <dbl>
## 1 b01 Bengali 5 1
## 2 b01 Bengali 0 0
## 3 b01 Bengali 0 0
## 4 b01 Bengali 0 0
## 5 b01 Bengali 0 0
## 6 b01 Bengali 0 0
## 7 b01 Bengali 0 0
## 8 b01 Bengali 0 0
## 9 b01 Bengali 0 0
## 10 b01 Bengali 8 3
## # … with 1,610 more rows
```

---

---

```
## 
## Call:
## glm(formula = ct ~ count, family = poisson(), data = gestures)
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -3.4755 -0.5329 -0.5329 -0.5329 6.0319 
## 
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.951931 0.062940 -31.01 <2e-16
## count 0.133939 0.002916 45.94 <2e-16
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
## Null deviance: 2265.1 on 1592 degrees of freedom
## Residual deviance: 1266.7 on 1591 degrees of freedom
## (27 observations deleted due to missingness)
## AIC: 1683.4
## 
## Number of Fisher Scoring iterations: 6
```

---

# What is the unit of the estimates?!

???

Should be counts, because we are estimating number of contingent talks.

But those cannot be counts, because counts are discrete and cannot be negative.

---

# Counts as log-odds

???

Linear models don't work well with counts (because they are discrete and cannot be negative)!

So we need to transform counts into something the model can work with.

And that something is (again) log-odds! How convenient!

---

# Counts as log-odds

---

To transform log-odds into counts, you can use the `exp()` function.

```r
exp(-1)
```

```
## [1] 0.3678794
```

```r
exp(0)
```

```
## [1] 1
```

```r
exp(1)
```

```
## [1] 2.718282
```

Note that `log()` is the inverse of `exp()`. For example: `log(1)` = 0 and `exp(0)` = 1.

---

# Counts: Contingent talks

---

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.95 0.0629 -31.0 3.65e-211
## 2 count 0.134 0.00292 45.9 0
```

- `Intercept`
  
  - log-odds = -1.9519
  - `exp(-1.9519)` = 0.14.
  - i.e.*when* gesture `count` is `0`, there are between 0 and 1 contingent talks.

- What about gesture `count`?

.f4[As with the estimates of continuous predictors in `lm()` these tell you the difference in `Intercept` when `count` goes from 0 to 1, **but in log-odds** rather than counts.]

.f3[When dealing with counts, we normally talk about effects as **rate of change**, aka **odd ratios**.]

???

So we need to transform log-odds into odd ratios (aka simply odds).

How do we do that? Easy! We use the `exp()` function.

So `exp()` converts log-odds into counts and into odd ratios.

---

???

When the odds are 1, then it means that there is no change.

Think about this this way:

If you start off with £5 and every day your savings change by a factor of 1, then every day you will still have those £5.
Because `5 * 1 = 5`.

But if every day your savings increase by a factor of 1.5, then after the first day you have £7.5 (`5 * 1.5 = 7.5`), after the second day you have £11.25 (`7.5 * 1.5 = 11.25`), and so on.

You see that the rate of change is always the same (1.5, or 50%), but the absolute change depends on the day's savings:

- first day: +2.5
- second day: +3.75
- third day: +5.62
- ...

---

- `count`

- log-odds = 0.133939.
  - `exp(0.133939)` = 1.14
  - i.e. for each unit increase of `count`, coontingent talks increase by a factor of 1.14.
  - In other words, for every one extra gesture, there is a 14% increase in contingent talks.

---

---

```
## 
## Call:
## glm(formula = ct ~ count + background, family = poisson(), data = gestures)
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -3.6725 -0.5924 -0.5744 -0.4268 5.8259 
## 
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.39622 0.12374 -19.365 < 2e-16
## count 0.13032 0.00325 40.097 < 2e-16
## backgroundChinese 0.65593 0.14901 4.402 1.07e-05
## backgroundEnglish 0.59421 0.15219 3.904 9.45e-05
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
## Null deviance: 2265.1 on 1592 degrees of freedom
## Residual deviance: 1243.6 on 1589 degrees of freedom
## (27 observations deleted due to missingness)
## AIC: 1664.2
## 
## Number of Fisher Scoring iterations: 6
```

---

```r
tidy(ct_lm_3, exponentiate = FALSE)
```

```
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.40 0.124 -19.4 1.52e-83
## 2 count 0.130 0.00325 40.1 0 
## 3 backgroundChinese 0.656 0.149 4.40 1.07e- 5
## 4 backgroundEnglish 0.594 0.152 3.90 9.45e- 5
```

---

```r
tidy(ct_lm_3, exponentiate = TRUE)
```

```
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0911 0.124 -19.4 1.52e-83
## 2 count 1.14 0.00325 40.1 0 
## 3 backgroundChinese 1.93 0.149 4.40 1.07e- 5
## 4 backgroundEnglish 1.81 0.152 3.90 9.45e- 5
```

---

???

- 10 mo infants perform on average 10 iconic gestures per day. At 11 mo, the number of gestures increases by a factor of 1.6. At 12 mo, there is a further increase by a factor of 1.2.

- Calculate the average number of gestures per day at 12 months based on the 10 month average (10 gestures).

- The average number of errors L2 learners make decreases by a factor of 0.2 every year and a half.

- Calculate the average number of errors after 6 years, assuming 265 errors at year 1.