6 Inference

Imbuing numbers with meaning is a good characterisation of the inference process. Here is how it works. We have a question about something. Let’s imagine that this something is the population of British Sign Language signers. We want to know whether the cultural background of the BSL signers is linked to different pragmatic uses of the sign for BROTHER. But we can’t survey the entire population of BSL signers. So instead of surveying all BSL users, we take a sample from the BSL population. The sample is our data (the product of our study or observation). Now, how do we go from data/observation to answering our question about the use of BROTHER? We can use the inference process!

Inference process

Inference is the process of understanding something about a population based on the sample (aka the data) taken from that population.

The figure below is a schematic representation of the inference process.

The inference process has two main stages: producing data and inference. For the first step, producing data, we start off with a population. Note that population can be a set of anything, not just a specific group of people. For example, the words in a dictionary can be a “population”; or the antipassive constructions of Austronesian languages, and so on. From that population, we select a sample and that sample produces our data. We analyse the data to get results. Finally, we use inference to understand something about the population based on the results from the sampled data. Inference can take many forms and the type of inference we are interested in here is statistical inference: i.e. using statistics to do inference.

However, despite inference being based on data, it does not guarantee that the answers to our questions are right or even that they are true. In fact, any observation we make comes with a certain degree of uncertainty and variability.

Quiz 1

True or false?

Inference is not needed if you gather data from the entire population.
Population refers only to human participants.
A sample is always a truthful representation of the population.
You can collect the same sample multiple times.

Spotlight: Science is wrong

Check out the Scientific American article If You Say ‘Science Is Right,’ You’re Wrong by Naomi Oreskes: https://www.scientificamerican.com/article/if-you-say-science-is-right-youre-wrong/.
Learn more about uncertainty and subjectivity in research: Vasishth and Gelman (2021), Gelman and Hennig (2017).

6.1 Uncertainty and variability

Pliny the Elder was a Roman philosopher who died in the Vesuvius eruption in 79 CE. He certainly did not expect to die then. Leaving dark irony aside, as researchers we have to deal with uncertainty and variability.

Uncertainty and variability

Uncertainty is a characteristic of each observation of a phenomenon, due to measurement error or because we cannot directly measure what we want to measure.
Variability is found among different observations of the same phenomenon, due to natural fluctuations and measurement error.

So uncertainty is a feature of each measurement, while variability occurs between different measurements. Together, uncertainty and variability render the inference process more complex and can interfere with its outcomes.

The following picture is a reconstruction of what Galileo Galilei saw when he pointed one of his first telescopes towards Saturn, based on his 1610 sketch: a blurry circle flanked by two smaller blurry circles.

Only six years later, telescopes were much better and Galileo could correctly identify that the flanking circles were not spheres orbiting around Saturn, but rings. The figures below show how the sketches evolved over time between first observation and publication.

The moral of the story is that at any point in history we are like Galileo in at least some of our research: we might be close to understanding something but not quite yet.

To give a more concrete example of how each sample is but an imperfect represetation of the population it is taken from, let’s look at reaction times (RTs) data from the MALD dataset (Tucker et al. 2019). The study the data comes from used an auditory lexical decision task to elicit RTs and accuracy data. In each trial, participants listened to a word and pressed a button to say if the word is a real English word or not. The RT is the time lag between the offset of the auditory stimulus (the target word) and the button press. Note that to keep things more manageable, the data we will read is just a subset of the full data. Figure 6.1 shows a density plot of the data: the x-axis is the range of RTs (in logged milliseconds), while the y-axis shows the “density” of the data. Higher density means that the data contains a lot of observations in that region. Conversely, low density means that the data does not contain a lot of observations in that range. You will learn more about density plots in Chapter 18.

Code

mald <- readRDS("data/tucker2019/mald_1_1.rds")

mald |> 
  filter(RT_log > 6) |> 
  ggplot(aes(RT_log)) +
  geom_density(fill = "purple", alpha = 0.5) +
  geom_rug(alpha = 0.1) +
  labs(x = "Reaction Times (logged ms)")

Figure 6.1: Distribution of logged RT values from the MALD data set. The density is a relative measure of how many observations for particular values there are in the data.

The mean logged RT is 6.88 and the standard deviation (a measure of dispersion around the mean, you will learn about them in Chapter 10) is 0.28. We might take these values as the mean and standard deviation of the population of logged RTs. But this would not be correct: these are the sample mean and standard deviation. To show why we cannot take these to be the mean and standard deviation of the population, we can simulate RT data based on those values (you will understand the details of this later on, when you learn about probability distributions in Chapter 18, so for now just stay for the ride).

Code

set.seed(9899)
rt_l <- list()
for (i in 1:10) {
  rt_l[i] <- list(rlnorm(n = 20, mean(mald$RT_log), sd(mald$RT_log)))
}

We sample 20 randomly generated values from a distribution with mean 6.88 and standard deviation 0.28. We then can take the mean and SD of these generated values and compare them to the original distribution’s mean and SD. But we can go further and sample 20 values 10 times. This procedure gives us 10 means and standard deviations, one for each sample of 20 values. The means and standard deviations of these 10 random samples are shown in Table 27.3.

Table 6.1

sample	mean	sd
1	6.84	0.24
2	7.02	0.30
3	6.92	0.19
4	6.88	0.31
5	6.84	0.27
6	6.80	0.22
7	6.84	0.25
8	6.95	0.34
9	6.99	0.31
10	6.97	0.32

You will notice that, while all the means and SD are very close to the mean and SD we sampled from, they are not exactly the same: every sample’s mean and SD are slightly different from each other and from the mean and SD we sample from. In other words, it’s very very unlikely that the sample mean and standard deviation are exactly the population mean and standard deviation. Inference is affected by uncertainty and variability. So what do we do with such uncertainty and variability? We can use statistics to quantify them!

Statistics

Statistics is a tool that helps us quantifying uncertainty and controlling for variability.

But what is statistics exactly?

6.2 What is statistics (and isn’t)?

Statistics is a tool. But what does it do? There are at least four ways of looking at statistics as a tool.

Statistics is the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data. (From UCI Department of Statistics)
Statistics is the technology of extracting information, illumination and understanding from data, often in the face of uncertainty. (From the British Academy)
Statistics is a mathematical and conceptual discipline that focuses on the relation between data and hypotheses. (From the Standford Encyclopedia of Philosophy)
Statistics is the art of applying the science of scientific methods. (From ORI Results, Nature)

To quote a historically important statistician:

Statistic is both a science and an art.

It is a science in that its methods are basically systematic and have general application and an art in that their successful application depends, to a considerable degree, on the skill and special experience of the statistician, and on his knowledge of the field of application.

—L. H. C. Tippett

Spotlight: Etymology

The word statistics is related to state and it is no coincidence. The discipline of statistics was born as a sub-field of politics and economics. Check out the full etymology of statistics here: https://en.wiktionary.org/wiki/statistics#Etymology_1.

Statistics is a many things, but it is also not a lot of things.

Statistics is not maths, but it is informed by maths.
Statistics is not about hard truths, but about how to seek the truth.
Statistics is not a purely objective endeavour. In fact there are a lot of subjective aspects to statistics (see below).
Statistics is not a substitute of common sense and expert knowledge.
Statistics is not just about \(p\)-values and significance testing.

As Gollum would put it, all that glisters is not gold.

Quiz 2

True or false?

Statistics is necessary if one wants to know the truth.
Statistics is only relevant to objective science.
Statistics is based on mathematics but it is also informed by philosophy.
We can completely remove uncertainty with statistics.

6.3 Many Analysts, One Data Set: subjectivity exposed

In Silberzahn et al. (2018), a group of researchers asked 29 independent analysis teams to answer the following question based on provided data: Is there a link between player skin tone and number of red cards in soccer? Crucially, 69% of the teams reported an effect of player skin tone, and 31% did not. In total, the 29 teams came up with 21 unique types of statistical analysis. These results clearly show how subjective statistics is and how even a straightforward question can lead to a multitude of answers. To put it in Silberzahn et al’s words: “The observed results from analyzing a complex data set can be highly contingent on justifiable, but subjective, analytic decisions. This is why you should always be somewhat sceptical of the results of any single study: you never know what results might have been found if another research team did the study. This is one of the reasons why replicating research is very important. You will learn about replication and related concepts in Chapter 17.

Coretta et al. (2023) tried something similar, but in the context of the speech sciences: they asked 30 independent analysis teams (84 signed up, 46 submitted an analysis, 30 submitted usable analyses) to answer the question: Do speakers acoustically modify utterances to signal atypical word combinations? Outstandingly, the 30 teams submitted 109 individual analyses—a bit more than 3 analyses per team!—and 52 unique measurement specifications in 47 unique model specifications. Coretta et al. (2023) say: “Nine teams out of the thirty (30%) reported to have found at least one statistically reliable effect (based on the inferential criteria they specified). Of the 170 critical model coefficients, 37 were claimed to show a statistically reliable effect (21.8%).” Figure 6.2 illustrates the analytic flexibility typical of acoustic analyses. (A) shows the pipeline of decision a researcher would have to do: which linguistic unit, which temporal window, which acoustic parameters and how to measure those. You can appreciate that there are potentially many combinations. (B) illustrates the fundamental frequency (f0) contour of the sentences “I can’t bear ANOTHER meeting on Zoom” and “I can’t bear another meeting on ZOOM”. In both sentences, the green shaded area marks the word “another”. Finally, in (C) you see the different parameters that can be extracted from the f0 contour of the word “another”. In sum, there are many choices a researcher is faced with and, while most of these choices might be justifiable, they are still subjective, as shown by the large variability of actual analyses carried out by the analysis teams in Coretta et al. (2023).

Figure 6.2: Illustration of the analytic flexibility associated with acoustic analyses (from Coretta et al. 2023)

6.4 The “New Statistics”

The Silberzahn et al. (2018) and Coretta et al. (2023) studies are just the tip of the iceberg. We are currently facing a “research crisis”. As mentioned above, we will dig deeper into this subject in Chapter 17. In brief, the research crisis is a mix of problems related to how research is conducted and published. In response to the research crisis, Cumming (2013) introduced a new approach to statistics, which he calls the “New Statistics”. The New Statistics mainly addresses three problems: (1) published research is a biased selection of all (existing and possible) research; (2) data analysis and reporting are often selective and biased, (3) in many research fields, studies are rarely replicated, so false conclusions persist. To help solve those problems, the New Statistics proposes these solutions (among others): (1) promoting research integrity, by which researchers explicitly discuss the subjectivity and shortcomings of quantitative research, (2) shifting away from statistical significance of differences between groups to quantitative estimation of those differences, (3) building a cumulative quantitative discipline, in which phenomena are studied again and again in the same contexts and with the same conditions to ensure they are robust enough.

Kruschke and Liddell (2018) revisit the New Statistics and make a further proposal: to adopt the historically older but only recently popularised approach of Bayesian statistics. They call this the Bayesian New Statistics. The classical approach to statistics is the frequentist method, based on work by Fisher, Neyman and Pearson. Put simply, frequentist statistics is based on rejecting the “null hypothesis” (i.e. the hypothesis that there is no difference between groups) using p-values. Bayesian statistics provides researchers with more appropriate and more robust ways to answer research questions, by reallocating belief or credibility across possibilities. You will learn more about the frequentist and the Bayesian approaches in Chapter 20.

This textbook adopts the Bayesian New Statistics approach. Note that we will not really touch upon Bayesian statistics in the strict sense until Chapter 20, just before statistical modelling will be introduced. So you should not worry too much about it for now: just try to appreciate that not only is statistics not intended to objectively separate truths from falsities, but also there are several ways to practise statistics. After all, statistics is a human activity, and like all other human activities it is embedded in the world constructed by humans and their idiosyncrasies.

Quiz 3

Three researchers meet at a coffee shop. Each of them tells the other two about their recent findings. Below, you can find what each said. Based on how they talk about the results, which one among them aligns with the New Statistics approach and recognises the shortcomings of research?

Researcher A. My team investigated the effect of emotional dysregulation on speaking rate and they found a significant effect. We have ultimately shown that people with emotional dysregulation speak faster. Researcher B. I wanted to know if it is true that languages with morphologically rich grammars are more difficult to learn than isolating languages. We tested several measures of learning difficulty in two groups of infants, one learning a morphologically rich language and one learning an isolating language. We found that two measures were significantly higher for the morphologically rich language group than the isolating language group. Hence we have found solid evidence that morphologically rich grammars are more difficult to learn. Researcher C. We compared reaction times (RTs) of chimpanzees looking at videos of humans vs chimpanzees signing. If low-level motor perception is mostly affected by the conspecificity (human vs conspecific), we should see differences in RTs of 20-70 ms. If motor resonance is mostly affected (activation of the observer’s own motor system when seeing an action they could perform themselves), we should find differences in RTs of 80-200 ms. We found that in the conspecific condition, RTs were 74-98 ms shorter at 95% probability. This range mostly overlaps with the motor resonance hypothesis (80-200) but it also lies outside of it, somewhat close to the higher end of the low-level perception range (20-70). In sum, we could not establish which hypothesis could better explain the data.

6.5 Summary

Inference is the process of learning something about a population through a sample.
Uncertainty in each observation and variability across observations affect the inference process.
Statistics is a tool to quantify uncertainty and variability.
The (Bayesian) New Statistics is an approach to statistics that highlights the subjective nature of statistics and stresses the importance of estimation over statistical significance.