This post is an overview of how factors (i.e. categorical variables) are coded under the hood and which types of coding can be set in R.^{1}

# Introduction

There’s seems to be a bit of terminological mix-up in the wild, so we first present a terminological set that will be used throughout the vignette.

Categorical variables in R are generally stored using factors.
A **factor** is a vector of values from a categorical variable.
The possible values in a factor are called **levels** in R.

For each observation in the factor, the vector specifies the level of that observation.

For example, let’s assume we have a data set with information on dinosaurs and one column specifies the dinosaur’s diet: `carnivore`

, `herbivore`

, `omnivore`

.

In R, this column can be coded as a factor:

`factor(c("carnivore", "carnivore", "herbivore", "omnivore", "herbivore"))`

```
## [1] carnivore carnivore herbivore omnivore herbivore
## Levels: carnivore herbivore omnivore
```

We are so accustomed to using factors in regression models that sometimes we forget that regressions only work with numbers and cannot work with categorical variables.

To fit a regression model with categorical variables, these are first converted to numbers.
The process of conversion is called **coding**.

One type of coding is the **dummy variable coding** or simply **dummy coding**.
This consists of assigning `0`

s or `1`

s to the levels in the variable.

Let’s go through a simple example of dummy coding of a categorical variable with only 2 levels: `metropolitan`

and `rural`

.

The most simple way of coding this categorical variable as a number is to assign `0`

to one level and `1`

to the other level. For example:

`factor(c("rural", "rural", "metropolitan", "rural"))`

```
## [1] rural rural metropolitan rural
## Levels: metropolitan rural
```

```
# dummy coded
c(0, 0, 1, 0)
```

`## [1] 0 0 1 0`

In R, dummy coding is done under the hood for you when using factors, so you don’t have to worry about the conversion.

When the categorical variable has 3 levels instead of 2, we need a work-around in order to code the 3-level factor with only `0`

s and `1`

s (we can’t use higher numbers for reasons we will see later).

With three levels, we can code the variable using two numeric variables (instead of just one). Going back to the dinosaur’s diet example, we can use:

- One variable that codes whether the dinosaur is a carnivore
`0`

or a herbivore`1`

. - One variable that codes whether the dinosaur is a carnivore
`0`

or an omnivore`1`

.

Let call the first variable `dummy_1`

and the second variable `dummy_2`

. Then:

- When
`dummy_1`

is`0`

and`dummy_2`

is also`0`

, the dinosaur is a carnivore. - When
`dummy_1`

is`1`

and`dummy_2`

is`0`

, the dinosaur is a herbivore. - When
`dummy_1`

is`0`

and`dummy_2`

is`1`

, the dinosaur is an omnivore.

So the following factor (repeated from above):

`factor(c("carnivore", "carnivore", "herbivore", "omnivore", "herbivore"))`

```
## [1] carnivore carnivore herbivore omnivore herbivore
## Levels: carnivore herbivore omnivore
```

can be coded as:

```
dummy_1 <- c(0, 0, 1, 0, 1) # carnivore (0) or herbivore (1)?
dummy_2 <- c(0, 0, 0, 1, 0) # carnivore (0) or omnivore (1)?
library(tidyverse)
tibble(
dummy_1, dummy_2
)
```

```
## # A tibble: 5 × 2
## dummy_1 dummy_2
## <dbl> <dbl>
## 1 0 0
## 2 0 0
## 3 1 0
## 4 0 1
## 5 1 0
```

If this doesn’t make much sense, try and figure it out by checking the value of the two columns for each row with the following code:

```
# case_when() is a very helpful function from dplyr!
case_when(
dummy_1 == 0 & dummy_2 == 0 ~ "carnivore",
dummy_1 == 1 & dummy_2 == 0 ~ "herbivore",
dummy_1 == 0 & dummy_2 == 1 ~ "omnivore",
)
```

What if the factor has 4 levels? Then you can code it with 3 dummy variables. And what about 5 levels? Use 4 dummy variables. The number of dummy variables needed is equal to the number of levels minus 1 (\(n_{dummy} = n_{levels} - 1\)).

## Summing up

To sum up:

**Factors**are vectors that code categorical variables.- The values in a factor are called
**levels**. - Regression models cannot work directly with factors, so these are coded using numbers.
**Dummy coding**is one way of coding factors as numbers using one or more numeric variables of`0`

s and`1`

s.

# Coding and contrasts

Now. We’ve seen that dummy coding is simply using dummy numeric variables with `0`

s and `1`

s.

In fact, this is **one way** of coding factors, or one coding scheme.^{2}
Different coding schemes in R are called **contrasts**.
Dummy coding is called **treatment contrasts** in R.

## Treatment contrasts

The term **treatment contrasts** comes from the clinical sciences where you test, for example, the efficacy of a medical intervention (a drug, surgery, etc…) by comparing a **control group** (which has not received the “treatment”) with a group that has received the medical intervention (the **treatment group**).

The control group can be used as the reference group to see if the treatment group has benefited from the medical treatment (i.e. if the treatment group’s health has improved after the intervention relative to the control group, then one can infer that the treatment was effective).

Let’s look at treatment contrasts in R.

In the previous section, we’ve been illustrating dummy coding by assigning `0`

s and `1`

s using one or more dummy variables.
In practice, you do not need to do that to run real analyses, because R does that under the hood for you.

Factors in R are coded with treatment contrasts by default.
Also by default, the first level is set as the reference level (the order is alphabetical by default).
The reference level is the level that gets coded only with `0`

s, as we have seen above for the dinosaur’s diet factor (`carnivorous`

had `dummy_1 = 0`

and `dummy_2 = 0`

).

Let’s see an example using a data table with measurements of vowel duration.

```
load("./static/data/vowels.rda")
glimpse(vowels)
```

```
## Rows: 886
## Columns: 9
## $ item <dbl> 20, 2, 11, 1, 15, 10, 13, 3, 14, 19, 4, 6, 16, 17, 5, 23…
## $ speaker <chr> "it01", "it01", "it01", "it01", "it01", "it01", "it01", …
## $ word <chr> "pugu", "pada", "poco", "pata", "boco", "podo", "boto", …
## $ v1_duration <dbl> 95.23720, 138.96844, 126.93226, 127.49888, 132.33310, 12…
## $ c2_voicing <chr> "voiced", "voiced", "voiceless", "voiceless", "voiceless…
## $ vowel <chr> "u", "a", "o", "a", "o", "o", "o", "a", "o", "u", "a", "…
## $ c2_place <chr> "velar", "coronal", "velar", "coronal", "velar", "corona…
## $ speech_rate <dbl> 4.893206, 5.015636, 4.819541, 5.031662, 5.063435, 5.0632…
## $ speech_rate_c <dbl> -0.55937531, -0.43694485, -0.63303978, -0.42091937, -0.3…
```

For example, let’s take the `vowel`

column.
This column indicates which vowel the measurement was taken from, and that can be /a/, /o/, or /u/.

If we convert the `vowel`

column into a factor, the levels will be `a`

, `o`

and `u`

, and `a`

will be the reference level.

```
vowels <- vowels %>%
mutate(vowel = as.factor(vowel))
levels(vowels$vowel)
```

`## [1] "a" "o" "u"`

To get a sense of how a factor would be coded with treatment contrasts, we can print a dummy coding table with the `contr.treatment()`

function.

`contr.treatment(levels(vowels$vowel))`

```
## o u
## a 0 0
## o 1 0
## u 0 1
```

Now, let’s run a regression model with `v1_duration`

as the outcome variable and `vowel`

as the predictor.

```
vow_lm <- lm(v1_duration ~ vowel, data = vowels)
summary(vow_lm)
```

```
##
## Call:
## lm(formula = v1_duration ~ vowel, data = vowels)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.874 -22.567 -2.293 17.025 106.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.616 1.806 71.227 <2e-16 ***
## vowelo -5.641 2.549 -2.213 0.0272 *
## vowelu -29.763 2.603 -11.432 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.38 on 883 degrees of freedom
## Multiple R-squared: 0.1419, Adjusted R-squared: 0.1399
## F-statistic: 72.99 on 2 and 883 DF, p-value: < 2.2e-16
```

The summary returns three coefficients:

`Intercept`

.`vowelo`

.`vowelu`

.

Since `a`

is the reference level of `vowel`

, the `Intercept`

corresponds to the mean duration of the vowel `a`

, i.e. 128 ms.

The coefficient of `o`

is the **difference between the mean duration of o and the mean duration of the reference level a** (i.e. the

`Intercept`

).
So `o`

is 5.6 ms shorter than `a`

on average (shorter because the coefficient is negative).Finally, the coefficient of `u`

is the **difference between the mean duration of u and the mean duration of the reference level a**
So

`u`

is 29.7 ms shorter than `a`

.This is how treatment contrasts work.

## Sum contrasts

Another type of coding is **effect coding**.
In R, the corresponding contrast type are the so-called **sum contrasts**.

When using sum contrasts, the levels in a factor are coded using `1`

s, `-1`

s and `0`

s.
If you sum the values of each dummy variable you always get `0`

(hence the name “sum” contrast).

Let’s see what happens to the factor `vowel`

when using sum contrasts (remember that factors use treatment contrasts by default).

This is how sum coding would look like for this factor:

`contr.sum(levels(vowels$vowel))`

```
## [,1] [,2]
## a 1 0
## o 0 1
## u -1 -1
```

Since there are 3 levels, we need two dummy variables.
So `a`

is coded as `1, 0`

, `o`

is coded as `0, 1`

, and `u`

as `-1, -1`

.

To set the contrasts of a factor to sum coding, we can run the following:

```
contrasts(vowels$vowel) <- "contr.sum"
# If you want to change it back to treatment contrasts you can run
# contrasts(vowels$vowel) <- "contr.treatment"
```

With sum contrasts the reference level is in fact the grand mean.

In our model of vowel duration this means that the `Intercept`

coefficient will be the grand mean of vowel duration.

Let’s rerun the model and look at the output.

```
vow_lm <- lm(v1_duration ~ vowel, data = vowels)
summary(vow_lm)
```

```
##
## Call:
## lm(formula = v1_duration ~ vowel, data = vowels)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.874 -22.567 -2.293 17.025 106.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.815 1.055 110.728 < 2e-16 ***
## vowel1 11.801 1.483 7.957 5.40e-15 ***
## vowel2 6.160 1.481 4.160 3.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.38 on 883 degrees of freedom
## Multiple R-squared: 0.1419, Adjusted R-squared: 0.1399
## F-statistic: 72.99 on 2 and 883 DF, p-value: < 2.2e-16
```

The `Intercept`

now is 116 ms, which mean that the mean of vowel duration across the three vowels is 116 ms.

We can check this by taking the mean:

`mean(vowels$v1_duration)`

`## [1] 117.2747`

Yup, pretty close (small differences are fine).

So what are now the coefficients called `vowel1`

and `vowel2`

?

These are, respectively, the difference between the mean duration of `a`

and the grand mean, and the difference between the mean duration of `o`

and the grand mean.

So `a`

is 11.8 ms longer than the grand mean, and `o`

is 6.1 ms longer than the grand mean.

What about `u`

then?

Easy.
You just subtract the coefficients of both `a`

and `o`

from the grand mean: \(116.8 - 11.8 - 6.1 = 98.9\).

If you want to check that this is correct, the mean duration of `u`

according to the model above where we used treatment contrasts was \(128.616 - 29.763 = 98.853\).

## Sum contrasts and interactions

Sum contrasts can be very handy when the model contains interactions between factors.

Let’s say we want to include in our model of vowel duration a predictor that specifies the voicing of the stop following the vowel. We also add an interaction between vowel and voicing, so that we can model differences in the effect of voicing across vowels.

```
vow_lm_2 <- lm(v1_duration ~ c2_voicing + vowel + c2_voicing:vowel, data = vowels)
summary(vow_lm_2)
```

```
##
## Call:
## lm(formula = v1_duration ~ c2_voicing + vowel + c2_voicing:vowel,
## data = vowels)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.905 -23.262 -2.033 18.134 115.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.469 1.449 85.232 < 2e-16 ***
## c2_voicingvoiceless -13.309 2.049 -6.496 1.37e-10 ***
## vowel1 14.606 2.037 7.171 1.57e-12 ***
## vowel2 8.564 2.033 4.212 2.79e-05 ***
## c2_voicingvoiceless:vowel1 -5.609 2.880 -1.947 0.0518 .
## c2_voicingvoiceless:vowel2 -4.808 2.876 -1.672 0.0949 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.47 on 880 degrees of freedom
## Multiple R-squared: 0.1937, Adjusted R-squared: 0.1891
## F-statistic: 42.29 on 5 and 880 DF, p-value: < 2.2e-16
```

Now the `Intercept`

is the mean vowel duration when the following stop is voiced (the reference level of `c2_voicing`

).
This means that the average vowel followed by a voiced stop is 123 ms long in our data.

The coefficient of `c2_voicingvoiceless`

tells us the mean effect of `c2_voicing`

on vowel duration, averaged across all vowels.
So, on average, a vowel is about 13 ms shorter when followed by a voiceless stop.

The coefficients of `vowel1`

and `vowel2`

indicate the difference between the average vowel duration before a voiced stop (the `Intercept`

) and `a`

and `o`

respectively.
As before, to get the difference between the average vowel duration of `u`

before a voiceless stop and the mean vowel duration, you just need to subtract the coefficients of `vowel1`

and `vowel2`

from the `Intercept`

: \(123.5 - 14.6 - 8.5 = 100.4\).

The last two coefficients, `c2_voicingvoiceless:vowel1`

and `c2_voicingvoiceless:vowel2`

correspond to the difference in the effect of voicing between the average effect of voicing (`c2_voicingvoiceless`

, i.e. -13 ms) and the effect of voicing in `a`

and `o`

respectively.
That is, the decrease of vowel duration for `a`

is 5.6 ms greater than the average effect, while the decrease of vowel duration for `o`

is 4.8 ms greater than the average effect.

Following the usual formula, the effect of voicing for `u`

is \(-13.309 - (-5.6) - (-4.8) = -2.9\).

A previous version of this post is also featured as a vignette in the learnB4SS package, https://learnb4ss.github.io/learnB4SS/articles/contrasts.html↩︎

You can learn about more coding schemes here: https://stats.idre.ucla.edu/spss/webbooks/reg/chapter5/regression-with-spsschapter-5-additional-coding-systems-for-categorical-variables-in-regressionanalysis/.↩︎