class: center, middle, inverse, title-slide .title[ # Statistics and Quantitative Methods (S2) ] .subtitle[ ## Week 2 ] .author[ ### Dr Stefano Coretta ] .institute[ ### University of Edinburgh ] .date[ ### 2023/01/24 ] --- class: center middle reverse # TURN MIC ON! --- layout: true ## Sample `\(y\)` --- .center[ ![](../../img/inference.png) ] When we ask a research question, we collect a sample `\(y\)` from a population. --- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ `\(y\)` is a sample of values (`\(y_1, y_2, y_3, ..., y_n\)`). ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ Sample of values can be e.g.: - Number of telic and atelic verbs in a historical corpus of Sanskrit. - Voice Onset Time of stops from 50 speakers Mapudungun. - Friendliness ratings of synthetic speech as indicated by 300 participants. - ... ] --- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ `\(y\)` is a sample of values (`\(y_1, y_2, y_3, ..., y_n\)`). ] .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **We say that the values in the sample `\(y\)` were generated by a (random) variable `\(Y\)`.** ] --- layout: false layout: true ## Variable `\(Y\)` --- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ `\(Y\)` is a (random) variable that generates the values in the sample `\(y\)`. ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **A (statistical) variable is any characteristics, number, or quantity that can be measured or counted** - When you observe or measure something, you are taking note of the values generated by the variable. - It's called variable because it varies (ha!). - The opposite of a variable is a *constant*. ] --- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ `\(Y\)` is a (random) variable that generates the values in the sample `\(y\)`. ] .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ Variables can be e.g.: - Token number of telic verbs and atelic verbs in written Sanskrit. - Voice Onset Time of stops in Mapudungun. - Friendliness ratings of synthetic speech. - ... ] --- layout: false layout: true ## Types of variables --- .center[ ![:scale 70%](../../img/num-cat.png) ] --- .center[ ![:scale 70%](../../img/cont-discr.png) ] --- .bg-washed-blue.b--purple.ba.bw2.br3.shadow-5.ph4.mt1[ **Numeric continuous variable**: *between any two values there is an infinite number of values*. - The variable can take on any positive and negative number, including 0. - The variable can take on any positive number only. - **Proportions** and **percentages**: The variable can take on any number between 0 and 1. ] -- .bg-washed-blue.b--purple.ba.bw2.br3.shadow-5.ph4.mt1[ **Numeric discrete variable**: *between any two consecutive values there are no other values*. - **Counts**: The variable can take only on any positive integer number. ] -- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt1[ **Categorical (discrete) variable**. - **Binary** or **dichotomous**: The variable can take only one of two values. - The variable can take any of three of more values. - **Ordinal**: The variable can take any of three of more values and the values have a natural order. ] --- layout: false # Operationalisation .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ We can **operationalise** something as a numeric or a categorical variable. ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ Think of ways to operationalise the following: - Voice Onset Time. - Friendliness of speech. - Lexical frequency. - ... ] --- # Quick poll .f3[*Think of ways to operationalise (a) Voice Onset Time, (b) Friendliness of speech, (c) Lexical frequency*] <br> .pull-left[ .f3[Join at] .f1[slido.com] .f1[\#2920 666] ] .pull-right[ .center[ ![](../../img/QR-SQM-2-Week-2.png) ] ] ??? Slido poll. <https://app.sli.do/event/s6sFzqiJGew4tBKayTZH4q> --- layout: true # Summary measures --- .center[ ![](../../img/data-summ.png) ] --- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ We can summarise variables using **summary measures**. ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ There are two types of summary measures. **Measures of central tendency** - Measures of central tendency indicate the **typical or central value** of a sample. **Measures of dispersion** - Measures of dispersion indicate the **spread or dispersion** of the sample values around the central tendency value. ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Always report a measure of central tendency together with a measure of dispersion!** ] --- layout: false # Measures of central tendency .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Mean** `$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + ... + x_n}{n}$$` ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Median** `$$\text{if } n \text{ is odd, } x_\frac{n+1}{2}$$` `$$\text{if } n \text{ is even, } \frac{x_\frac{n}{2} + x_\frac{n}{2}}{2}$$` ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Mode** The most common value. ] --- # Measures of dispersion .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Range** $$ max(x) - min(x)$$ The difference between the largest and smallest value. ] -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[ **Standard deviation** `$$\text{SD} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n+1}} = \sqrt{\frac{(x_1 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n+1}}$$` ] --- # Mean Use the mean with **numeric continuous variables**, if: - The variable can take on any positive and negative number, including 0. ```r mean(c(-1.12, 0.95, 0.41, -2.1, 0.09)) ``` ``` ## [1] -0.354 ``` - The variable can take on any positive number only. ```r mean(c(0.32, 2.58, 1.5, 0.12, 1.09)) ``` ``` ## [1] 1.122 ``` -- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt1[ **Don't use the mean with proportions and percentages!** ] --- layout: true # Median --- Use the median with **numeric continuous and discrete variables**. ```r median(c(-1.12, 0.95, 0.41, -2.1, 0.09)) ``` ``` ## [1] 0.09 ``` ```r sort(c(-1.12, 0.95, 0.41, -2.1, 0.09)) ``` ``` ## [1] -2.10 -1.12 0.09 0.41 0.95 ``` ```r median(c(0.32, 2.58, 1.5, 0.12, 1.09)) ``` ``` ## [1] 1.09 ``` ```r sort(c(0.32, 2.58, 1.5, 0.12, 1.09)) ``` ``` ## [1] 0.12 0.32 1.09 1.50 2.58 ``` --- ```r median(c(4, 6, 3, 9, 7, 15)) ``` ``` ## [1] 6.5 ``` ```r sort(c(4, 6, 3, 9, 7, 15)) ``` ``` ## [1] 3 4 6 7 9 15 ``` --- ```r median(c(4, 6, 3, 9, 7, 15)) ``` ``` ## [1] 6.5 ``` ```r mean(c(4, 6, 3, 9, 7, 15)) ``` ``` ## [1] 7.333333 ``` ```r median(c(4, 6, 3, 9, 7, 40)) ``` ``` ## [1] 6.5 ``` ```r mean(c(4, 6, 3, 9, 7, 40)) ``` ``` ## [1] 11.5 ``` --- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt1[ - **The mean is very sensitive to outliers.** - The median is **not**. ] --- layout: false # Mode Use the mode with **categorical discrete variables**. ```r table(c("red", "red", "blue", "yellow", "blue", "green", "red", "yellow")) ``` ``` ## ## blue green red yellow ## 2 1 3 2 ``` The mode is the most frequent value: `red`. -- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt1[ **Likert scales are ordinal (categorical) variables, so the mean and median are not appropriate!** ] --- # Range Use the range with any **numeric variable**. ```r x_1 <- c(-1.12, 0.95, 0.41, -2.1, 0.09) max(x_1) - min(x_1) ``` ``` ## [1] 3.05 ``` ```r x_2 <- c(0.32, 2.58, 1.5, 0.12, 1.09) max(x_2) - min(x_2) ``` ``` ## [1] 2.46 ``` ```r x_3 <- c(4, 6, 3, 9, 7, 15) max(x_3) - min(x_3) ``` ``` ## [1] 12 ``` --- # Standard deviation Use the standard deviation with **numeric continuous variables**, if: - The variable can take on any positive and negative number, including 0. ```r sd(c(-1.12, 0.95, 0.41, -2.1, 0.09)) ``` ``` ## [1] 1.23658 ``` - The variable can take on any positive number only. ```r sd(c(0.32, 2.58, 1.5, 0.12, 1.09)) ``` ``` ## [1] 0.9895555 ``` -- .bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt1[ **Don't use the standard deviation with proportions and percentages!** Standard deviations are **relative** and depend on the measurement **unit/scale!** ] --- # Quick poll .f3[*For which of the following variables is the MEDIAN appropriate?*] <br> .pull-left[ .f3[Join at] .f1[slido.com] .f1[\#2920 666] ] .pull-right[ .center[ ![](../../img/QR-SQM-2-Week-2.png) ] ] ??? Slido poll. <https://app.sli.do/event/s6sFzqiJGew4tBKayTZH4q> --- # Summary measures overview <br> <br> <br> .center[ ![](../../img/measures-overview.png) ] --- # Summary .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[ - The **sample** `\(y\)` is generated by a (random) variable `\(Y\)`. - A (statistical) **variable** is any characteristics, number, or quantity that can be measured or counted. - Variables can be **numeric or categorical**. - Numeric variables can be continuous or discrete. - Categorical variables are only discrete. - We **operationalise** a measure/observation as a numeric or a categorical variable. - We summarise variables using **summary measures**: - Measures of **central tendency** indicate the typical or central value of a sample. - Measures of **dispersion indicate** the spread or dispersion of the sample values around the central tendency value. ]