8 Statistical variables

8.1 Estimandum, estimands and statistical variables

Statistical variables are a fundamental aspect of quantitative data analysis. There isn’t an agreed upon definition of a statistical variable, but generally speaking, anything that you have measured or counted is a statistical variable. For example, let’s say you want to measure language proficiency in L2 learners: “language proficiency” is your estimandum, i.e. the concept or entity you wish to measure; you decide to measure language proficiency using the score of a proficiency test, this is the estimand, i.e. the specific measurement of the estimandum “language proficiency”. When the estimand can take on different values, the estimand is a statistical variable: every participant will have a different proficiency score.

Estimandum, estimands and variable

An estimandum is any characteristic, phenomenon, entity, or concept that is the target of the measurement/counting process.

An estimand is the specific quantity of an estimandum that can be measured.

A (statistical) variable is any estimand or characteristics, number, or quantity that has been measured or counted and can vary.

Language research involves a large variety of statistical variables. Here just a few examples:

Token number of telic verbs and atelic verbs in a corpus of written Sanskrit.
Voice Onset Time of stops in Mapudungun.
Friendliness ratings of synthetic speech.
Accuracy of responses in a lexical decision task.
Digit memory span.
Phrasal headedness (head-initial vs head-final).

Try and think of more!

So a statistical variable is a measured characteristic. More specifically, a statistical variable is also a mathematical construct: the outcome of the specific mathematical process that generates the values that can be observed and measured. In the case of language proficiency, the statistical variable “proficiency test score” is generated by a process that includes a lot of factors (which can themselves be construed as statistical variables), like actual proficiency, stress levels when taking the test, baseline memory capacity, years of learning and so on. The generative process, i.e. the process that generates the values of a statistical variable, is ultimately what the researcher is interested in.

Generative process

The generative process of an estimand is the mathematical process that generates the values of the estimand that are observed or measured.

When you observe or measure something, i.e. when you collect a sample, you are taking note of the values of the statistical variable generated by the generative process. We call them statistical variables because each time you sample the variable, you get different values. In other words, the generative process allows for variation in the output values. The opposite of a variable is a called a statistical constant. Generative processes can contain both variables and constants. Statistical variables and constants are two types of estimands. In practice, you don’t have to worry about whether something is a variable or a constant and in most research contexts you will be working with statistical variables.

Quiz 1

True or false?

The estimandum refers to the specific measurable quantity of a concept or entity.
The generative process comprises only statistical variables.
A statistical variable is defined as any measurable or countable entity that can vary in value.

8.2 Types of variables

You will find that some statistics textbooks overcomplicate things when it come to types of statistical variables. From an applied statistics perspective, you only need to be able to identify numeric vs categorical variables and continuous vs discrete variables.

8.2.1 Numeric vs categorical variables

The distinction is quite self-explanatory:

Numeric variables are variables that are numbers.
Categorical variables are variables that correspond to categories, groups or levels on a scale.

Examples

Numeric variables

Number of multi-verb predicates in a book.
Duration of stressed vowels.
Rating score between 0-100.

Categorical variables

Gender (non-binary, female, male, …).
First vs second language users.
Ejective vs non-ejective consonant.

Learning how to recognise variables is a fundamental skill in quantitative data analysis, since the type of variables determines the type of analyses you can carry out.

8.2.2 Continuous vs discrete variables

Orthogonal to the numeric/categorical distinction, there is the continuous vs discrete distinction. This one can be at times less straightforward.

A continuous variable is a variable that can take on any value between any two numbers. For example, speech segment duration can be 0.2 s, 0.25 s, 0.2534 s and so on. Segment duration is continuous.
A discrete variable is a variable that can only take on a set of values, and no value in between. For example, number of gestures is discrete because you can measure 1, 2, 3, 10 gestures but not 3 gestures and three quarters.

Numeric variables can be either continuous or discrete, while categorical variables can only be discrete. There are also sub-types of numeric continuous, numeric discrete and categorical (discrete) variables. The following call-out introduces these sub-types, with examples.

Types of variables

Numeric continuous variable: between any two values there is an infinite number of values.

The variable can take on any positive and negative number, including 0. For example, temperature in degrees Celsius.
The variable can take on any positive number only. For example, segment duration, fundamental frequency (f0), reaction times.
Proportions and percentages: The variable can take on any number between 0 and 1. For example, proportion of accurate responses, probability of scalar inference, proportion of voicing during stop closure, acceptability rating on a 0-100 scale.

Numeric discrete variable: between any two consecutive values there are no other values.

Counts: The variable can take only on any positive integer number. For example, number of telic and atelic verbs in a corpus, number of words known by a child, number of turns in a conversation.

Categorical (discrete) variable. There are three main subtypes.

Binary or dichotomous: The variable can take only one of two values. For example, accuracy (incorrect, correct), voicing (voiceless, voiced), headedness (initial vs final).
The variable can take any of three of more values (sometimes called a multinomial variable). For example, gender (non-binary, female, male), place of articulation (labial, coronal, dorsal, glottal, …).
Ordinal: The variable can take any of three of more values and the values have a natural order. For example, Likert scales of attitude (positive, indifferent, negative), proficiency (functional, good, very good, native-like), lenition (stop, fricative, approximant, deletion).

8.3 Operationalisation

It should be clear now that the estimand is not quite the same thing as the estimandum. The estimand is the researcher’s way to capture the estimandum so that it can be analysed. The relationship between the estimandum and the estimand variable is called operationalisation: an estimandum is operationalised into an estimand. The action of operationalisation consists in choosing how to measure something: as a numeric or as a categorical variable. In some cases, the choice is obvious, but in most cases something could be operationalised either way and different considerations have to be taken into account when choosing, like the particular framework adopted and the study design.

Let’s think about “age” for a moment: age can be operationalised as years or months (numeric discrete) or as age bins, like young vs old (categorical). Different studies might require one or the other operationalisation of the estimandum “age”. Another example is “acceptability” in morphosyntactic studies: acceptability can be operationalised as a binary categorical variable (grammatical vs agrammatical, and we normally talk of “grammaticality”), as a categorical scale (acceptable, somewhat acceptable, somewhat not acceptable, unacceptable), or a numeric continuous scale (0 to 100). It is important, when planning a study, to carefully think about estimanda (the plural of estimandum) and estimands and how their relationship could be less clear than one might think.

Exercise 1

Think of all the ways to operationalise the following variables:

Voice Onset Time.
Friendliness of speech.
Lexical frequency.

Quiz 2

Which of the following sets contains only discrete variables.

Number of occurences in corpus, sentence duration (ms), articulation rate (syllables per second) Reaction times (ms), accuracy (correct/incorrect), 7-point likert scale Accuracy (correct/incorrect), 7-point likert scale, number of occurrences in corpus Fundamental frequency (hz), response accuracy (percentage), reaction times (ms)