36 What you’ve learned
This chapter summarises the concepts, methods and skills that were covered in the textbook. You can use this as a final check that you didn’t miss anything important. Chapter 37 lists further topics you could read about to continue your learning process, Chapter 38 is an overview of a possible workflow for quantitative studies (note that it covers steps and topics that you have not learned, but references are given when necessary), and finally Chapter 39 concludes the textbook.
36.1 Research methods
Research methods comprises of the research process, project management, digital skills, philosophy, ethics.
The research process is made up of context, data acquisition, data analysis, communication.
Empirical research can be placed along two research axes (descriptive/explanatory, exploratory/corroboratory) and it can have on or more research objectives (establish facts, improve fit of framework to facts, compare fit of different frameworks).
The research context spans everything between the more general research topic down to the more specific research questions and hypotheses.
Research questions and hypotheses should be precise and testable.
Quantitative data analysis includes the following main steps: summarising, visualising and modelling.
As part of the research cycle, we are faced with researcher’s degrees of freedom and with Questionable Research Practices. This has led to a series of research crises, related to four concepts of research reliability: reproducibility, replicability, robustness and generalisability.
The Null Ritual has been established as the standard frequentist approach in contemporary research, despite the theoretical pitfalls. P-values are frequently misinterpreted and misused.
Open Research is a movement that stresses the importance of a more honest and transparent research by promoting a series of research principles and by warning from common, although not necessarily intentional, questionable practices and misconceptions.
36.2 Statistics and R
R is a statistical programming language, RStudio is an IDE.
RStudio/Quarto projects allow you to organise your data and code. Quarto documents allow you to include formatted text, code and output in a single file.
The tidyverse packages are a set of R packages for reading, transforming and visualising data.
Packages can be installed into the R package library with
install.packages()and can be attached withlibrary().You can read data with the
read_*()functions from the readr and readxl packages, or withreadRDS().Transforming data can be achieved with
mutate(),summarise(), andfilter()from the dplyr package. Other relevant functions aregroup_by()andcount().You can plot data with the ggplot2 package, using
ggplot(),aes(), and geometries/statistic functions.Data frames/tibbles can be pivoted with the
pivot_longer()andpivot_wider()functions from the tidyr package.Statistical inference is about learning something about a population through a sample taken from the population.
Statistics is about uncertainty and variability.
Probability distribution and probability intervals allow us to quantify uncertainty and variability.
Bayesian inference is a statistical inference approach based on the Bayesian interpretation of probabilities (as the probability of an event happening).
Bayes’ Theorem states that the posterior probability distribution is proportional to the product of the prior probability distribution and the likelihood of the data (i.e. the probability distribution of the data given the prior probability distribution).
Credible Intervals (CrI) tell you the range of values of the posterior probability distribution that are within a certain probability level. For example, a 80% CrI tells you that there is an 80% probability or confidence that the parameter’s value is within that interval.
Regression models are statistical models based on the equation of a line.
The outcome variable of a regression model determines the distribution family of the model:
For a numeric continuous variable that can be negative and positive use a Gaussian distribution.
For a binary categorical variable use a Bernoulli distribution.
For a numeric continuous variable that can only be positive use a log-normal distribution.
The simplest Bayesian Gaussian Regression model is of the form:
\[ \begin{aligned} y & \sim Gaussian(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot x\\ \end{aligned} \]
- The simplest Bayesian Bernoulli regression model is of the form:
\[ \begin{aligned} y & \sim Bernoulli(p)\\ logit(p) & = \beta_0 + \beta_1 \cdot x\\ \end{aligned} \]
- The simplest Bayesian log-normal regression model is of the form (\(\mu\) and \(\sigma\) are on the log scale):
\[ \begin{aligned} y & \sim LogNormal(\mu, \sigma)\\ \mu & = \beta_0 + \beta_1 \cdot x\\ \end{aligned} \]
Categorical predictors have to be numerically code. The default coding system in regression models is treatment contrasts, where predictors are coded using dummy variables (
0/1). The number of dummy variables isN-1whereNis the number of levels in the predictor.Interactions between predictors allow you to model the effect of one predictor based on the levels of another. The R syntax for interactions is
x + w + x:worx * w.Bayesian regression models can be fitted with
brm()from the brms package.brms uses the Markov Chain Monte Carlo algorithm to estimate the values of the model’s parameters. The output of MCMC is a list of MCMC draws. All operations on the model are operations on the MCMC draws (like summarising and plotting).
You can extract the draws from the model with
as_draws_df(). The output can be transformed, summarised and plotted.conditional_effects()is a convenience function that plots the expected values of the outcome based on model’s draws.The draws of Bernoulli models are in log-odds and log-odds can be transformed into probabilities with the inverse logit (or logistic) function, in R
plogis(). Remember to always includeb_Interceptwhen usingplogis()(doplogis(b_Intercept)andplogis(b_Intercept + ...)where...is other coefficients.You can calculate Credible Intervals with
quantile2()from the posterior package. You should calculate CrIs at multiple probability levels (95, 90, 80, 70, 60, …). You can also find the largest CrI that contains only positive or negative values if the 95% CrI spans both positive and negative values.