class: center, middle, inverse, title-slide .title[ # Statistics and Quantitative Methods (S1) ] .subtitle[ ## Week 3 ] .author[ ### Dr Stefano Coretta ] .institute[ ### University of Edinburgh ] .date[ ### 2022/10/04 ] --- # Slido quiz 1 .f3[*Which of the following is the formula of a straight line?*] <br> .pull-left[ .f3[Join at] .f1[slido.com] .f1[\#2272 798] ] .pull-right[ .center[ ![](../../img/QR-SQM-1-Week-3.png) ] ] ??? Slido poll. <https://app.sli.do/event/krtv5FmoQy4ESjd45RhHQF> --- class: inverse background-image: url(../../img/time-travel.jpg) background-size: contain # Time travel... --- layout: true # Given the .green[formula], find the .orange[points] --- ## You have: .f2[$$x = (2, 4, 5, 8, 10, 23, 36)$$] .f2[$$y = 3 + 2x$$] ## Find: .f2[$$y$$] --- <img src="index_files/figure-html/homework-1.png" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/line-1.png" height="500px" style="display: block; margin: auto;" /> --- layout: false layout: true # Given the .orange[points], find the .green[formula] --- ## You have: .f2[$$x = (2, 4, 5, 8, 10, 23, 36)$$] .f2[$$y = (8, 14, 17, 26, 32, 71, 110)$$] ## Find: .f2[$$y = ? + ? x$$] --- <img src="index_files/figure-html/sample-1.png" height="500px" style="display: block; margin: auto;" /> --- layout: false # But... .pull-left[ ![](../../img/world-data.jpg) ] .pull-right[ .f2[Data is an .red[imperfect] representation of the world!] Measurements are .red[noisy]. ] --- # Given the .orange[points], find the .green[formula] <img src="index_files/figure-html/sample-noisy-1.png" height="500px" style="display: block; margin: auto;" /> --- class: center middle inverse # If `\(x\)` changes, what happens to `\(y\)`? --- # Now enter Linear Models <div style="width:100%;height:0;padding-bottom:42%;position:relative;"><iframe src="https://giphy.com/embed/3owzVYjZSzuFivWpHi" width="100%" height="100%" style="position:absolute" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div> ??? [via GIPHY](https://giphy.com/gifs/starwars-movie-star-wars-3owzVYjZSzuFivWpHi) --- layout: true # Linear model: the basics --- <img src="index_files/figure-html/lm-plot-1.png" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/lm-plot-2-1.png" height="500px" style="display: block; margin: auto;" /> --- .f1[ `$$y = ? + ? x$$` ] --- .f1[ `$$y = \beta_0 + \beta_1x$$` ] <br> -- .f2[ `$$\beta_0 = intercept$$` `$$\beta_1 = slope$$` ] -- <br> We know `\(x\)` and `\(y\)` and we need to .green[estimate] `\(\beta_0\)` and `\(\beta_1\)`. --- layout: false class: center middle <iframe src="https://stefanocoretta.shinyapps.io/lines/" width="1200" height="600"></iframe> ??? <https://stefanocoretta.shinyapps.io/lines/> --- # Slido quiz 2 .f3[*Which of the following is NOT the formula of a straight line?*] <br> .pull-left[ .f3[Join at] .f1[slido.com] .f1[\#2272 798] ] .pull-right[ .center[ ![](../../img/QR-SQM-1-Week-3.png) ] ] ??? Slido poll. <https://app.sli.do/event/krtv5FmoQy4ESjd45RhHQF> --- layout: true # Linear model: the basics --- <img src="index_files/figure-html/sample-2-1.png" height="500px" style="display: block; margin: auto;" /> --- ```r line_lm <- lm(y ~ x, data = line_2) summary(line_lm) ``` ``` ## ## Call: ## lm(formula = y ~ x, data = line_2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.2197 -5.2243 0.3434 6.2411 20.7275 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.99182 2.66626 0.747 0.459 ## x 3.02079 0.04762 63.432 <2e-16 ## ## Residual standard error: 10.07 on 48 degrees of freedom ## Multiple R-squared: 0.9882, Adjusted R-squared: 0.988 ## F-statistic: 4024 on 1 and 48 DF, p-value: < 2.2e-16 ``` --- <img src="index_files/figure-html/line-model-1.png" height="500px" style="display: block; margin: auto;" /> --- layout: false layout: true # Standard error --- ```r summary(line_lm) ``` ``` ## ## Call: ## lm(formula = y ~ x, data = line_2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.2197 -5.2243 0.3434 6.2411 20.7275 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.99182 2.66626 0.747 0.459 ## x 3.02079 0.04762 63.432 <2e-16 ## ## Residual standard error: 10.07 on 48 degrees of freedom ## Multiple R-squared: 0.9882, Adjusted R-squared: 0.988 ## F-statistic: 4024 on 1 and 48 DF, p-value: < 2.2e-16 ``` ??? The **standard error** is a measure of (lack of) precision of the estimates. The greater the standard error the least precise the estimate is. Standard errors tend to decrease with more data. --- <img src="index_files/figure-html/se-1.png" height="500px" style="display: block; margin: auto;" /> ??? In this plot you see a set of lines that represent some of the possible lines that are a good fit for the data (according to the model). These lines are a random selection from all the possible lines that are compatible with the estimates and standard errors calculated by the model. In other words, any of these could be the actual line that "generated" the data. **We cannot be sure because of the error each estimate comes with!** --- Let's try now with more data (N = 300). ``` ## ## Call: ## lm(formula = y ~ x, data = line_3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -29.718 -6.762 -0.188 6.479 40.657 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.10313 1.18137 1.78 0.0761 ## x 2.99189 0.02079 143.94 <2e-16 ## ## Residual standard error: 10.05 on 298 degrees of freedom ## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9858 ## F-statistic: 2.072e+04 on 1 and 298 DF, p-value: < 2.2e-16 ``` --- <img src="index_files/figure-html/se-2-1.png" height="500px" style="display: block; margin: auto;" /> ??? Compare this to the previous plot. Now the lines are much less variable, because the error of the estimates is smaller. (Of course it is, there is much more data to estimate those numbers with!) --- layout: false layout: true # Residuals --- <img src="index_files/figure-html/line-model-2-1.png" height="500px" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/line-model-3-1.png" height="500px" style="display: block; margin: auto;" /> ??? A **residual** is the difference between the *y* value of a raw data point and the predicted value for the *x* value of that point. One way to estimate linear models is to find the line that minimises the residuals across all data points. There are other methods, but for the purpose of applied data analysis, it doesn't matter which one is used and specific R packages use specific methods. --- <img src="index_files/figure-html/line-model-4-1.png" height="500px" style="display: block; margin: auto;" /> ??? See how the residuals here are much larger, because the line is "off" relative to the data points? --- <img src="index_files/figure-html/line-model-5-1.png" height="500px" style="display: block; margin: auto;" /> ??? Another example of bad "fit" of the line (or model) to the data.