20 Frequentist and Bayesian inference – Quantitative Data Analysis for Linguists in R

20.1 Frequentist vs Bayesian probability

The two major approaches to statistical inference are the frequentist and the Bayesian approaches. The difference between these approaches is how probabilities are conceived, from a philosophical perspective if you will.

Probabilities in a frequentist framework are about average occurrences of events in a hypothetical series of repetitions of those events. Imagine you observe a volcano for a long period of time. The number of times the volcano erupts within that time tells us the frequency of occurrence of the event of volcanic eruption. In other words, it tells us its (frequentist) probability.

In the Bayesian framework, probabilities are about the level of (un)certainty that an event will occur at any specific time given certain conditions. This is probably the way we normally think about probabilities: like in the weather forecast, if somebody tells you tomorrow it will rain with a probability of 85%, you intuitively know that it is very likely that it will rain tomorrow although not certain.

In the context of research, a frequentist probability tells you the probability of obtaining the same result again given an imaginary series of replications of the study that generated that probability. On the other hand, a Bayesian probability tells you the probability that the result of the study is the actual result you should have gotten.

20.2 Frequentist inference

Most of current research is carried out with frequentist methods. This is a historical accident, based on both a misunderstanding of Bayesian statistics (which is, by the way, older than frequentist statistics) and the fact that frequentist maths was much easier (and personal computers did not exist).

The commonly accepted approach to frequentist inference is the so-called Null Hypothesis Significance Testing, or NHST. As practised by researchers, the NHST approach is a (sometimes incoherent) mix of the frequentist work of Fisher on one hand, and Neyman and Pearson on the other. The inconsistent nature of NHST as practised has led to the elaboration of the concept and label “Null Ritual” (Gigerenzer 2004, 2018; Gigerenzer, Krauss, and Vitouch 2004) and the slogan-titled paper The difference between “significant” and “not significant” is not itself statistically significant (Gelman and Stern 2006). The Null Ritual has been criticised by frequentist and Bayesian statisticians alike and has resulted in the proposal or alternative, stricter, versions of NHST, like Statistical Inference as Severe Testing (SIST, Mayo 2018; for a critique see Gelman et al. 2019).

The main tenet of NHST (or better, the Null Ritual) is that you set a null hypothesis and you try to reject it. A null hypothesis is, in practice, always a nil hypothesis: in other words, it is the hypothesis that there is no difference between two estimands (these usually being means of two or more groups of interest). Using a variety of numerical techniques, one obtains a p-value, i.e. a frequentist probability. The p-value is used for inference: if the p-value is smaller than a threshold, one can reject the nil hypothesis; if the p-value is equal to or greater than the threshold, you cannot reject the null hypothesis. p-values are very commonly mistaken for Bayesian probabilities (Cassidy et al. 2019) and this results in various misinterpretations of reported results. You will learn about the meaning (and misunderstandings) of p-values in Chapter 30.

20.3 Bayesian inference

Bayesian inference approaches are now gaining momentum in many fields, including linguistics. The main advantage of Bayesian inference is that it allows researchers to answer research questions in a more straightforward way, using a more intuitive take on uncertainty and probability than what frequentist methods can offer.

Bayesian inference is based on the concept of updating prior beliefs in light of new data. Given a set of prior probabilities and observations, Bayesian inference allows us to revise those prior probabilities and produce posterior probabilities. This is possible through the Bayesian interpretation of probabilities in the context of Bayes’ Theorem.

In simple conceptual terms, the Bayesian interpretation of Bayes’ Theorem states that the probability of a hypothesis \(h\) given the observe data \(d\) is proportional to the product of the prior probability of \(h\) and the probability of \(d\) given \(h\).

\[ P(h|d) \sim P(h) \cdot P(d|h) \]

The prior probability \(P(h)\) represents the researcher’s beliefs towards \(h\). These beliefs can be based on expert knowledge, previous studies or mathematical principles. For a more hands-on overview, I recommend Chapter 2 of the Statistical Rethinking textbook.

The following section goes through a few reasons to use Bayesian inference for research. Note that some of these reasons presuppose understanding of regression modelling, both frequentist and Bayesian, so don’t worry if they are not clear yet.

20.4 Why Bayesian inference?

Here are a few practical and conceptual reasons for why you should consider switching to Bayesian statistics for your research.

Practical reasons

Fitting frequentist models can lead to anti-conservative p-values (i.e. increased false positive rates, Type-I error rates: there is no effect but yet you got a significant p-value). An interesting example of this for the non-technically inclined reader can be found here. LMER tends to be more sensitive to small sample sizes than Bayesian models (with small sample sizes, Bayesian models return estimates with greater uncertainty, which is a more conservative approach).
While very simple models will return very similar estimates in frequentist and Bayesian statistics, in most cases more complex models won’t fit if run with frequentist packages like lme4, especially without adequate enough sample sizes. Bayesian regression models always converge, while frequentist ones don’t always do.
Frequentist regression models require as much work as Bayesian ones, although it is common practice to skip necessary steps when fitting the former, which gives the impression of it being a quicker process. Factoring out the time needed to run Markov Chain Monte Carlo chains in Bayesian regressions, in frequentist regressions you still have to perform robust perspective power analyses and post-hoc model checks.
With Bayesian models, you can reuse posterior distributions from previous work and include that knowledge as priors into your Bayesian analysis. This feature effectively speeds up the discovery process (getting to the real value estimate of interest faster). You can embed previous knowledge in Bayesian models while you can’t in frequentist ones.

Conceptual reasons

Frequentist regression models cannot provide evidence for a difference between groups, only evidence to reject the null (i.e. nil) hypothesis.
A frequentist Confidence Interval (CI) can only tell us that, if we run the same study multiple times, n percent of the time the CI will include the real value (but we don’t know whether the CI we got in our study is one from the 100-n percent of CIs that DO NOT CONTAIN the real value). On the other hand, a Bayesian Credible Interval (CrI) ALWAYS tells us that the real value is within a certain range at n percent probability. (Of course all conditional on model and data, which is true both for frequentist and Bayesian models alike). So, frequentist models really just gives you a point estimate, while Bayesian models give a range of values.
With Bayesian regressions you can compare any hypothesis, not just null vs alternative. (Although you can use information criteria with frequentist models).
Frequentist regression models are based on an imaginary set of experiments that you never actually carry out.
Bayesian regression models will converge towards the true value in the long run. Frequentist models do not.

Of course, there are merits in fitting frequentist models, for example in corporate decisions, but you’ll still have to do a lot of work. The main conceptual difference then is that frequentist and Bayesian regression models answer very different questions and as (basic) researchers we are generally interested in questions that the latter can answer and the former cannot.