Statistics and Quantitative Methods (S1)

.title[
# Statistics and Quantitative Methods (S1)
]
.subtitle[
## Week 4
]
.author[
### Dr Stefano Coretta
]
.institute[
### University of Edinburgh
]
.date[
### 2022/10/11
]

---

# Summary of last week

---

The data:

```
## # A tibble: 1,334 × 5
## speaker word vowel v1_duration f1
## <chr> <chr> <chr> <dbl> <dbl>
## 1 it01 pugu u 95.2 313.
## 2 it01 pada a 139. 781.
## 3 it01 poco o 127. 563.
## 4 it01 pata a 127. 749.
## 5 it01 boco o 132. 552.
## 6 it01 podo o 125. 557.
## 7 it01 boto o 134. 548.
## 8 it01 paca a 119. 733.
## 9 it01 bodo o 130. 554.
## 10 it01 pucu u 99.1 283.
## # … with 1,324 more rows
```

---

---

```
## 
## Call:
## lm(formula = v1_duration ~ f1, data = dur_ita_pol)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -66.05 -25.16 -6.76 21.46 122.18 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.33909 3.02569 23.25 <2e-16
## f1 0.05915 0.00504 11.74 <2e-16
## 
## Residual standard error: 32.92 on 1332 degrees of freedom
## Multiple R-squared: 0.09373,	Adjusted R-squared: 0.09305 
## F-statistic: 137.8 on 1 and 1332 DF, p-value: < 2.2e-16
```

---

We fitted a linear model to vowel duration with first formant (F1) as the only predictor.

According to the model, the mean vowel duration is 70 ms (SE = 3) when F1 is 0 Hz.
For each unit increase of F1, vowels get 0.06 ms longer (SE = 0.005).
In other words, for each 1 Hz increase in F1, there is a 0.06 ms increase in vowel duration.

This means that when F1 increases by 100 Hz, vowels are 6 ms longer.

---

- We can fit linear models with the `lm()` function.
  
  - `lm(outcome ~ predictor, data = your_data)`.
  
- The model `summary()` provides you with the estimates of the model and their standard errors.

- Interpret the estimated coefficients with care!

- So far we used **continuous** predictors.

- `RT ~ PhonLev`: Reaction times as a function of Levenshtein distance.
  - `inmn ~ articulation_rate`: Mean intensity as a function of articulation rate.
  - `v1_duration ~ c2_clos_duration`: Vowel duration as a function of closure duration.

---

We know `RT` and `PhonLev` and we need to .green[estimate] `$\beta_0$` and `$\beta_1$`.

---

# What if we have *categorical* predictors?

Like group, condition, gender, L1/L2, treatment, place of articulation...

---

# Categorical predictors

---

The data:

```
## # A tibble: 5,000 × 3
## Subject RT IsWord
## <chr> <int> <fct> 
## 1 15308 617 TRUE 
## 2 15308 1198 FALSE 
## 3 15308 954 TRUE 
## 4 15308 579 TRUE 
## 5 15308 1011 FALSE 
## 6 15308 1402 TRUE 
## 7 15308 1059 FALSE 
## 8 15308 739 TRUE 
## 9 15308 789 TRUE 
## 10 15308 926 FALSE 
## # … with 4,990 more rows
```

---

---

We know `RT` and `IsWord` (`TRUE` or `FALSE`) and we need to .green[estimate] `$\beta_0$` and `$\beta_1$`.

---

layout: false
class: bottom
background-image: url(../../img/hannah-grace-fk4tiMlDFF0-unsplash.jpg)
background-size: cover

# .white[But *IsWord* is not numeric!!!]

???

Photo by <a href="https://unsplash.com/@oddityandgrace?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">hannah grace</a> on <a href="https://unsplash.com/s/photos/surprised-puppy?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

---

# Coding categorical predictors

---

Technically, linear models only work with numeric variables.

So we can simply **code our categorical predictors using numbers!**

Two common types of coding systems:

- **Treatment** coding.

- **Sum** coding. (We'll do this in Week 7!).

Note that you don't have to manually code predictors when using `lm()`.
R does that for you! (But it is important to understand the process to understand how to interpret the model output)

???

As with anything else, naming of coding systems is not an established matter and the same coding can have different names, and vice versa the same name could refer to different systems.

For an excellent overview, see <https://debruine.github.io/faux/articles/contrasts.html>.

---

**Treatment** coding uses `0`s and `1`s to code categorical predictors.

---

```
## # A tibble: 5,000 × 5
## Subject Item RT IsWord IsWord_coded
## <chr> <chr> <int> <fct> <dbl>
## 1 15308 acreage 617 TRUE 0
## 2 15308 maxraezaxr 1198 FALSE 1
## 3 15308 prognosis 954 TRUE 0
## 4 15308 giggles 579 TRUE 0
## 5 15308 baazh 1011 FALSE 1
## 6 15308 unflagging 1402 TRUE 0
## 7 15308 ihnpaykaxrz 1059 FALSE 1
## 8 15308 hawk 739 TRUE 0
## 9 15308 assessing 789 TRUE 0
## 10 15308 mehlaxl 926 FALSE 1
## # … with 4,990 more rows
```

**NOTE:** You don't have to do the coding of the `IsWord` column yourself! `lm()` does everything for you.

---

`$$\text{RT} = \beta_0 + \beta_1\text{IsWord}$$`

`$$\text{RT}(IsWord = TRUE) = \beta_0 + \beta_1 \cdot 0 = \beta_0$$`

`$$\text{RT}(IsWord = FALSE) = \beta_0 + \beta_1 \cdot 1 =  \beta_0 + \beta_1$$`

---

With treatment coding, the first level in the predictor is the **reference level** (here, it is `TRUE`).

The other level is compared against the reference level (`FALSE` vs `TRUE`).

???

For now, it is easy because we only have two levels, but we will see an example with three later.

`$\beta_0$`, aka the *intercept*, corresponds to the mean Reaction Time when `IsWord = TRUE`, because `TRUE` is the reference level of `IsWord`.

`$\beta_1$` corresponds to the **DIFFERENCE** between the *intercept* and the mean Reaction Time when `IsWord = FALSE`. In other words, we are comparing RTs when `IsWord = FALSE` vs when `IsWord = TRUE`.

---

```r
isword_lm <- lm(RT ~ IsWord, data = mald_1_1)
summary(isword_lm)
```

```
## 
## Call:
## lm(formula = RT ~ IsWord, data = mald_1_1)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -919.07 -195.07 -69.34 104.66 2033.93 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 953.074 6.211 153.44 <2e-16
## IsWordFALSE 116.261 8.839 13.15 <2e-16
## 
## Residual standard error: 312.5 on 4998 degrees of freedom
## Multiple R-squared: 0.03346,	Adjusted R-squared: 0.03326 
## F-statistic: 173 on 1 and 4998 DF, p-value: < 2.2e-16
```

???

So, how do we interpret the output? Easy!

- `(Intercept)`: the mean Reaction Times when `IsWord` is `TRUE`.

- `IsWordFalse`:

- The **effect** of `IsWord` on RTs when `IsWord = FALSE`.
  - The **difference** between RTs when `IsWord = FALSE` and `IsWord = TRUE`.
  - What you **need to add to the `(Intercept)`** to get the mean RT value of `IsWord = FALSE`.

---

???

If you are curious about the inner workings of the coding system, try the following code:

```r
mald_coded <- mald_1_1 %>%
 mutate(
 IsWordFALSE = ifelse(
 IsWord == "TRUE",
 0,
 1)
 )

lm_2 <- lm(RT ~ IsWordFALSE,
 data = mald_coded)
summary(lm_2)
```

---

# Three-levels

---

The data:

```
## # A tibble: 879 × 4
## speaker word vowel v1_duration
## <chr> <chr> <fct> <dbl>
## 1 it01 pugu u 95.2
## 2 it01 pada a 139. 
## 3 it01 poco o 127. 
## 4 it01 pata a 127. 
## 5 it01 boco o 132. 
## 6 it01 podo o 125. 
## 7 it01 boto o 134. 
## 8 it01 paca a 119. 
## 9 it01 bodo o 130. 
## 10 it01 pucu u 99.1
## # … with 869 more rows
```

---

---

How can we code three levels with just `0`s and `1`s?

---

Easy! We cheat the system...

.f3[
|      | vowel_o  | vowel_u |
| ---- | -------: | -------: |
| a  | 0        | 0        |
| o  | 1        | 0        |
| u | 0        | 1        |
]

---

```
## # A tibble: 879 × 6
## speaker word vowel vowel_o vowel_u v1_duration
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 it01 pugu u 0 1 95.2
## 2 it01 pada a 0 0 139. 
## 3 it01 poco o 1 0 127. 
## 4 it01 pata a 0 0 127. 
## 5 it01 boco o 1 0 132. 
## 6 it01 podo o 1 0 125. 
## 7 it01 boto o 1 0 134. 
## 8 it01 paca a 0 0 119. 
## 9 it01 bodo o 1 0 130. 
## 10 it01 pucu u 0 1 99.1
## # … with 869 more rows
```

**NOTE:** You don't have to do the coding of the `vowel` column yourself! `lm()` does everything for you.

---

.f3[
|      | vowel_o  | vowel_u |
| ---- | -------: | -------: |
| a  | 0        | 0        |
| o  | 1        | 0        |
| u | 0        | 1        |
]

`$$\text{vdur} = \beta_0 + \beta_1\text{vowel_o} + \beta_2\text{vowel_u}$$`

We know `v1_duration` and `vowel` (`a`, `o` or `u`) and we need to .green[estimate] `$\beta_0$`,  `$\beta_1$` and `$\beta_2$`.

---

`$$\text{vdur}(vowel = a) = \beta_0 + \beta_1 \cdot 0 + \beta_2 \cdot 0 = \beta_0$$`

`$$\text{vdur}(vowel = o) = \beta_0 + \beta_1 \cdot 1 + \beta_2 \cdot 0 = \beta_0 + \beta_1$$`

`$$\text{vdur}(vowel = u) = \beta_0 + \beta_1 \cdot 0 + \beta_2 \cdot 1 = \beta_0 + \beta_2$$`

---

```
## 
## Call:
## lm(formula = v1_duration ~ vowel, data = dur_ita)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -67.003 -22.543 -2.243 17.069 106.798 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.745 1.812 71.03 <2e-16
## vowelo -5.813 2.561 -2.27 0.0235
## vowelu -29.891 2.606 -11.47 <2e-16
## 
## Residual standard error: 31.34 on 876 degrees of freedom
## Multiple R-squared: 0.1433,	Adjusted R-squared: 0.1414 
## F-statistic: 73.28 on 2 and 876 DF, p-value: < 2.2e-16
```

---

---

```
## # Predicted values of v1_duration
## 
## vowel | Predicted |           95% CI
## ------------------------------------
## a     |    128.74 | [125.19, 132.30]
## o     |    122.93 | [119.39, 126.48]
## u     |     98.85 | [ 95.18, 102.52]
```

---

# Four levels

---

The data:

```
## # A tibble: 3,268 × 5
## speaker word vowel height voice_duration
## <chr> <chr> <fct> <fct> <dbl>
## 1 it01 poto o mid-low 133. 
## 2 it01 topo o mid-low 167. 
## 3 it01 pato a low 178. 
## 4 it01 teto e mid-high 142. 
## 5 it01 toto o mid-low 154. 
## 6 it01 puco u high 77.0
## 7 it01 chipo i high 78.7
## 8 it01 peco e mid-high 105. 
## 9 it01 poco o mid-low 137. 
## 10 it01 poto o mid-low 129. 
## # … with 3,258 more rows
```

---

---

You know the drill...

.f3[
|          | height_midlow | height_midhigh | height_high |
|----------| :----------------: | :-----------------: | :-------------: |
| low      | 0              | 0               | 0           |
| mid-low  | 1              | 0               | 0           |
| mid-high | 0              | 1               | 0           |
| high     | 0              | 0               | 1           |
]

---

```
## # A tibble: 3,268 × 7
## speaker word vowel height height_midlow height_midhigh height_high
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 it01 poto o mid-low 1 0 0
## 2 it01 topo o mid-low 1 0 0
## 3 it01 pato a low 0 0 0
## 4 it01 teto e mid-high 0 1 0
## 5 it01 toto o mid-low 1 0 0
## 6 it01 puco u high 0 0 1
## 7 it01 chipo i high 0 0 1
## 8 it01 peco e mid-high 0 1 0
## 9 it01 poco o mid-low 1 0 0
## 10 it01 poto o mid-low 1 0 0
## # … with 3,258 more rows
```

**NOTE:** You don't have to do the coding of the `height` column yourself! `lm()` does everything for you.

---

`$$\text{voicedur} = \beta_0 + \beta_1\text{height_midlow} + \beta_2\text{height_midhigh} + \beta_3\text{height_high}$$`

We know `voice_duration` and `height` (`low`, `mid-low`, `mid-high` or `high`) and we need to .green[estimate] `$\beta_0$`, `$\beta_1$`, `$\beta_2$` and `$\beta_3$`.

---

```
## 
## Call:
## lm(formula = voice_duration ~ height, data = ita_egg)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -74.535 -17.721 -1.852 14.815 89.431 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.635 1.091 115.158 <2e-16
## heightmid-low -3.309 1.493 -2.217 0.0267
## heightmid-high -16.180 1.534 -10.549 <2e-16
## heighthigh -26.751 1.313 -20.367 <2e-16
## 
## Residual standard error: 25.28 on 2894 degrees of freedom
## (370 observations deleted due to missingness)
## Multiple R-squared: 0.1669,	Adjusted R-squared: 0.1661 
## F-statistic: 193.3 on 3 and 2894 DF, p-value: < 2.2e-16
```

---

---

```
## # Predicted values of voice_duration
## 
## height   | Predicted |           95% CI
## ---------------------------------------
## low      |    125.63 | [123.50, 127.77]
## mid-low  |    122.33 | [120.33, 124.32]
## mid-high |    109.46 | [107.34, 111.57]
## high     |     98.88 | [ 97.45, 100.32]
```

---

# Two categorical predictors

---

The data:

```
## # A tibble: 879 × 5
## speaker word vowel c2_phonation v1_duration
## <chr> <chr> <fct> <chr> <dbl>
## 1 it01 pugu u voiced 95.2
## 2 it01 pada a voiced 139. 
## 3 it01 poco o voiceless 127. 
## 4 it01 pata a voiceless 127. 
## 5 it01 boco o voiceless 132. 
## 6 it01 podo o voiced 125. 
## 7 it01 boto o voiceless 134. 
## 8 it01 paca a voiceless 119. 
## 9 it01 bodo o voiced 130. 
## 10 it01 pucu u voiceless 99.1
## # … with 869 more rows
```

---

```r
vdur_lm_2 <- lm(v1_duration ~ vowel + c2_phonation, data = dur_ita)

summary(vdur_lm_2)
```

```
## 
## Call:
## lm(formula = v1_duration ~ vowel + c2_phonation, data = dur_ita)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -62.385 -23.657 -2.486 17.934 113.540 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.555 2.051 66.089 < 2e-16
## vowelo -5.791 2.501 -2.315 0.0208
## vowelu -29.914 2.546 -11.751 < 2e-16
## c2_phonationvoiceless -13.575 2.065 -6.574 8.41e-11
## 
## Residual standard error: 30.61 on 875 degrees of freedom
## Multiple R-squared: 0.1836,	Adjusted R-squared: 0.1809 
## F-statistic: 65.61 on 3 and 875 DF, p-value: < 2.2e-16
```

---

```r
ggpredict(vdur_lm_2)
```

```
## $vowel
## # Predicted values of v1_duration
## 
## vowel | Predicted |           95% CI
## ------------------------------------
## a     |    121.98 | [117.97, 125.99]
## o     |    116.19 | [112.18, 120.19]
## u     |     92.07 | [ 87.95,  96.18]
## 
## Adjusted for:
## * c2_phonation = voiceless
## 
## $c2_phonation
## # Predicted values of v1_duration
## 
## c2_phonation | Predicted |           95% CI
## -------------------------------------------
## voiced       |    135.56 | [131.53, 139.58]
## voiceless    |    121.98 | [117.97, 125.99]
## 
## Adjusted for:
## * vowel = a
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "vdur_lm_2"
```

---

```r
ggpredict(vdur_lm_2, terms = c("vowel", "c2_phonation"))
```

```
## # Predicted values of v1_duration
## 
## # c2_phonation = voiced
## 
## vowel | Predicted |           95% CI
## ------------------------------------
## a     |    135.56 | [131.53, 139.58]
## o     |    129.76 | [125.75, 133.78]
## u     |    105.64 | [101.52, 109.76]
## 
## # c2_phonation = voiceless
## 
## vowel | Predicted |           95% CI
## ------------------------------------
## a     |    121.98 | [117.97, 125.99]
## o     |    116.19 | [112.18, 120.19]
## u     |     92.07 | [ 87.95,  96.18]
```

---

---

---

# Summary

* `$y = \beta_0 + \beta_1 x$`, in R syntax `lm(y ~ x)`.
  
* We want to estimate the `$\beta_n$` .orange[**coefficients**].

* When `x` and `y` are **both continuous variables**, all good!

* `$\beta_0$` is the intercept.
  * `$\beta_1$` is the slope.

* In other words:
  
  * `$\beta_0$` is the mean value of `y` when `x` is 0.
  * `$\beta_1$` is what you have to add (or subtract) to `$\beta_0$` when `x` goes from 0 to 1.
]

* We code discrete predictors as numbers with treatment coding (i.e. with `0`s and `1`s).
  
  * `$\beta_0$` is the mean value of `y` when `x` is at the reference level.
  
  * `$\beta_1$` is what you have to add (or subtract) to `$\beta_0$` when `x` is at the other level.
  
  * and so on for each `$\beta_n$`...
]