1 Introduction

These instructions explain how to obtain the key descriptive statistics for acoustic measurements using R, both as numerical values and as various types of plots.

Descriptive statistics summarise what is going on in the data. If the descriptive statistics are done well in an investigation and the subsequent written report, then it is clear what is going on. As Russ Lenth, a statistician at the University of Iowa and author of the excellent R package emmeans puts it,

Failure to describe what is actually going on in the data is a failure to do an adequate analysis. Use lots of plots, and think.

1.1 R and RStudio

R is a programming language used for statistical computing. It is based on the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues in the 1970s.

R is a programming language in the sense that, instead of using the mouse to select commands from menus, you enter commands either interactively (via a command-line interface) or by executing a script, much as you have been doing with Praat. Like Praat scripting, R takes some getting used to, but for many basic tasks, once you have a template it is straightforward to make small alternations for different data frames or types of plots.

RStudio is Integrated Development Environment, or IDE, for R. RStudio allows users to develop and edit programs in R by supporting a large number of statistical packages, higher quality graphics, and the ability to manage your files and more. For example, you can easily have multiple R sessions open at once in a single instance of RStudio.

Note that RStudio cannot function without R. Think of R as the engine and RStudio as the console of a car.

There is a huge amount of information about R and RStudio on the internet, and countless tutorials and introductions you can find via Google or other search engines.

In this document, we will focus on just a few tasks:

  • Importing data into R.
  • Generating numerical summaries.
  • Making and saving plots.

2 Install R and RStudio

Before you start, you will have to have installed both R and RStudio. Make sure you followed the instructions on Learn.

3 Getting data into R

R comes with several functions to import, change and plot data. They all do a very good job.

Recently, though, a set of add-on packages have been developed to create a more consistent user experience.

This set of packages is called the tidyverse.

When you install R, you only get “base” R, so the tidyverse packages are not installed.

To install them all you can simply run the following:

install.packages("tidyverse")

Now the tidyverse packages are installed in the R library. The library is the place (generally one folder) where the packages files are installed to.

However, to use packages that are not part of base R, you will have to attach them every time you start an R session in RStudio (i.e. every time you launch RStudio).

You attach packages with the library() function, like so:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

3.1 Data file format

First, have a look at what the cherokeeVOT.txt file looks like. You can open it in RStudio by clicking on the file name in the Files panel.1

The first few lines will look like this:

VOT year    Consonant
67  1971    k
127 1971    k
79  1971    k
150 1971    k
53  1971    k
65  1971    k
75  1971    k
109 1971    k
109 1971    t
126 1971    t

This file lists the mean VOTs for both /t/ and /k/ for a number of Cherokee speakers. Each line represents the mean VOT for a single speaker and phone, along with the year the measurement was made.

Note that the first line of this file, called the header lists the names of the columns (or variables). These names will be imported into R when we import the file. If your data file doesn’t have a header, it’s possible to add it from within R, but it’s usually simpler to make sure that your data file has a header before you import it.

In this example, the different columns are separated by a “tab” character. Characters that separate columns are called delimiters. Files with tabular data separated by tabs are called tab-separated values files, and have either a .txt extension or .tsv.

Another delimiter is the comma ,, in which case the file might have the extension .csv (for comma-separated values). R doesn’t really care what delimiter you use, so long as you use it consistently within one file.

3.2 Importing data into R

There are a couple of ways to import the data from the file into R. Importing data means making R being able to read and manipulate the data. Think of this as loading the data in the memory of R.

One way to import data is to select File > Import dataset > From Text (readr...) from the menu. This will bring up a file browser and you can navigate to wherever on your computer you saved the file. You will then see an Import Dataset dialog with a bunch of options. If you just click Import, a new RStudio tab will open showing the data:

You will also see some lines of text have appeared in the Console window:

> cherokeeVOT <- read_delim("data/cherokeeVOT.txt")
> View(cherokeeVOT)

These two lines of code do two things:

  1. Read the cherkokeeVOT.txt file, and store it as a data frame (more specifically a tibble, but never mind for now) named cherokeeVOT.
  2. Open a tab in RStudio allowing you to inspect the data.

Using the menu command saved us some typing, but helpfully, RStudio shows you the code you would have needed to type to achieve exactly the same result. In this, it’s a bit like the Paste history command in Praat, except a little more user-friendly.

Exactly what appears in the read_delim() command will depend on where your file is and your operating system. But what you can see is that, much like in Praat, the menu item can be translated into a text command.

This is very useful because you can save these commands into a script, and then run the script directly, without having to type all the commands out over and over again. Unlike Praat, however, the RStudio console is interactive, meaning you can try commands out in real time and immediately see the results. For instance, we could read the cherokeeVOT.txt file in ourselves, and maybe give it a different name:

vot <- read_delim("data/cherokeeVOT.txt")
## Rows: 44 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Consonant
## dbl (2): VOT, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If we want to look at the contents of the file directly in the Console, we can just type its name:

vot
## # A tibble: 44 × 3
##      VOT  year Consonant
##    <dbl> <dbl> <chr>    
##  1    67  1971 k        
##  2   127  1971 k        
##  3    79  1971 k        
##  4   150  1971 k        
##  5    53  1971 k        
##  6    65  1971 k        
##  7    75  1971 k        
##  8   109  1971 k        
##  9   109  1971 t        
## 10   126  1971 t        
## # … with 34 more rows

This data table is relatively short (only 44 entries), but for larger data tables, viewing the data in this way is impractical. We can either use the View() command (note: capitalization is important here), which will open the data in a scrollable tab, or head(), which just shows the first few lines:

head(vot)
## # A tibble: 6 × 3
##     VOT  year Consonant
##   <dbl> <dbl> <chr>    
## 1    67  1971 k        
## 2   127  1971 k        
## 3    79  1971 k        
## 4   150  1971 k        
## 5    53  1971 k        
## 6    65  1971 k

We can add a number after the name of the data object to specific the number of lines:

head(vot, 10)
## # A tibble: 10 × 3
##      VOT  year Consonant
##    <dbl> <dbl> <chr>    
##  1    67  1971 k        
##  2   127  1971 k        
##  3    79  1971 k        
##  4   150  1971 k        
##  5    53  1971 k        
##  6    65  1971 k        
##  7    75  1971 k        
##  8   109  1971 k        
##  9   109  1971 t        
## 10   126  1971 t

4 Numerical summaries

The most common acoustic measurements (e.g. VOT durations, F0 at vowel midpoint, etc.) are numeric variables that can only take on positive numbers and they generally cannot be 0. If a vowel has a duration of 0 ms, then there is no vowel.

The descriptive statistics that are most appropriate in relation to such variables are the mean and the median (two measures of central tendency) and the standard deviation (a measure of dispersion around the mean).

There are a number of ways you can calculate these in R.

The first is the summary function:

summary(vot)
##       VOT              year       Consonant        
##  Min.   : 45.00   Min.   :1971   Length:44         
##  1st Qu.: 71.50   1st Qu.:1971   Class :character  
##  Median : 81.50   Median :2001   Mode  :character  
##  Mean   : 96.45   Mean   :1989                     
##  3rd Qu.:120.25   3rd Qu.:2001                     
##  Max.   :193.00   Max.   :2001

Here we see both the median and the mean, in addition to the highest (Max.) and lowest (Min.) values, and the 1st and 3rd quartiles, which are the middle numbers between the median and the minimum or maximum, respectively.

Another way to get these values is to ask for them explicitly, using the functions mean(), median() and sd() (for standard deviation):

mean(vot$VOT)
## [1] 96.45455
median(vot$VOT)
## [1] 81.5
sd(vot$VOT)
## [1] 38.3839

A function normally takes one or more arguments (the things you put between the parentheses).

The argument in the three functions above has three parts. First is the name of the data object we’re interested in (in this case, the data frame vot). The dollar sign $ is a way of saying “just return a single column of the named data frame”. Then, we put the name of that column to the right of the $. So here, we’re asking for the VOT column of the vot data frame. Maybe cherokeeVOT was a better name for the data frame after all. But you can see that it doesn’t matter: the column and the data frame object could have the exact same name, and it wouldn’t confuse R, since one refers to the data frame itself, and the other to a particular column of that data frame.

For some columns, it doesn’t make any sense to ask for a quantity like the mean, and R will quite reasonably complain:

mean(vot$Consonant)
## Warning in mean.default(vot$Consonant): argument is not numeric or logical:
## returning NA
## [1] NA

To get more information about the arguments that a function takes, you can always use the built-in manual pages by prefacing the function with a question mark ?:

?mean

5 Group data with group_by()

OK, so we can get descriptive statistics for the entire data frame. But what if we want to know, say, the mean and standard deviation of VOT by year?

Here the function group_by() comes in handy. group_by() allows us to group data based on some columns. We can use the grouping to calculate summary statistics for each group.

For example, to calculate the mean VOT values for each year in our data:

group_by(
  .data = vot,
  year
) %>%               # this is called a "pipe", see below
  summarise(
    # COLUMN_NAME = FUNCTION
    vot_mean = mean(VOT),
    vot_sd = sd(VOT)
  )
## # A tibble: 2 × 3
##    year vot_mean vot_sd
##   <dbl>    <dbl>  <dbl>
## 1  1971    114.    35.9
## 2  2001     84.7   36.1

Let’s unpack that.

The group_by() function takes some data (here vot) and, well, groups them by year. Note that this is done “under the hood” so you don’t really see the effects of grouping until you do something with the grouped data.

What’s that weird thing now, %>%?

That is called a pipe. A pipe is a symbol that tells R to pass whatever the function before it spits out onto the function that follows it, so you don’t have to specify the data in the following function. It works like a pipe, a tube that brings water from a tank to your sink, for example.

Think of the pipe as a “teleporter”: it just teleports the output of one function onto the next.

Then we use the summarise() function to get numerical summaries of the grouped data frame. This function requires you to specify the name of the new columns that will hold the summary results and the functions to calculate the summaries.

Note that now you can specify an existing column name directly, without having to use the $ syntax as be did at the beginning.

Here’s a walk-through of the code:

summarise(
  # Create a new column called `vot_mean`
  # and calculate the mean of the VOT column
  vot_mean = mean(VOT),
  # Create a new column called `vot_sd`
  # and calculate the standard deviation of the VOT column
  vot_sd = sd(VOT)
)

The output of the code now has a vot_mean and a vot_sd column, with the VOT mean and standard deviation for the year 1971 and 2001 in separate rows.

You get means and standards deviations for the two years because we grouped the data by year with group_by().

Exercise. Try the summarise() function with the original non-grouped data.

summarise(
  .data = vot,
  # Add here below the necessary code to calculate the mean and SD
  
  
)

If you want to group by multiple columns, not just one, you simply add all the column names you want, separated by a comma ,, in group_by().

group_by(
  .data = vot,
  year,
  Consonant
) %>%
  summarise(
    vot_mean = mean(VOT),
    vot_sd = sd(VOT),
    obs_n = n() # number of observations per group
  )
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 4 × 5
## # Groups:   year [2]
##    year Consonant vot_mean vot_sd obs_n
##   <dbl> <chr>        <dbl>  <dbl> <int>
## 1  1971 k             90.6   34.2     8
## 2  1971 t            132.    26.3    10
## 3  2001 k             98.1   45.0    13
## 4  2001 t             71.2   17.3    13

Tip: When you start collecting your R commands into scripts, it’s usually best to put all the library() calls at the very beginning of the script, so that all the functions in those package libraries are available for the rest of the script.

Now that the package is loaded, you can use all of the functions it provides.

6 Graphical summaries

There are different ways to generate plots in R, using both the base (built-in) graphics as well as packages such as lattice or ggplot2.

In this tutorial, we’ll make use of the ggplot2 package. Once again, there are many, many online resources related to ggplot2; ggplot2.tidyverse.org is a good place to start.

6.1 Strip charts

Strip charts are a basic type of plot that works very well when you have a mix of numeric and categorical variables, as we do here (VOT is numeric and Consonant is categorical).

Let’s start off with a minimal strip chart.

As a first simple example, let’s just plot the VOT means and standard deviations for /t/ and /k/, ignoring the year.

ggplot(
  data = vot,
  mapping = aes(x = Consonant, y = VOT)
) +
  geom_jitter(width = 0.1) # try width values 0.2, 0.5, 0.8 to see what the argument does

Awesome!

There’s a lot going on in here, though. Let’s unpack the syntax of this function call a little bit.

The heart of the code is the ggplot() function itself (note that there’s no 2 in the function name, as opposed to the name of the package ggplot2).

A call to ggplot takes at least two arguments. The first, data, is the data frame containing the columns of information we’re going to use. The second, mapping, is used to “map” data to so-called aesthetics.

Aesthetics are visual aspects of the graph, like axes, colour, and labels. You specify aesthetics using the aes() function.

The most simple aesthetics are the x and y axes (not that some plots take only x and some other also have z). In the plot above we are mapping the x-axis to Consonant and the y-axis to VOT.

Once you have specified the data and the data-to-aesthetics mapping, you can add geometries.

Geometries are the shapes, text, labels, lines, etc, that represent the individual data points in the plot. The strip chart plot uses the “jitter” geometry. The function for the jitter geometry is simply geom_jitter() (geom is short for geometry).

Most times it’s sufficient to include a geom_*() function without arguments (within the parentheses). Here, we specified the argument width.

Exercise: Try different values for the argument width to see what it does.

Note that all the component functions in a ggplot are concatenated together with a plus + sign.2

It’s OK for now if you don’t fully understand all the components of the syntax. The important thing is to see where the different parts of the data appear. You can then copy and paste this basic template and substitute the relevant columns from your own data until you get comfortable with the syntax.

6.2 Faceting

To visualise subsets of data, you can use faceting.

Faceting creates separate panels (of facets) based on a specified column. Here, we can facet the VOT data by Year.

ggplot(
  data = vot,
  mapping = aes(x = Consonant, y = VOT)
) +
  geom_jitter(width = 0.1) +
  facet_grid(~ year)

Do you notice something interesting with the data now that we have faceted by year?

Exercise. Fill in the code below to have year on the x-axis and faceting by Consonant.

ggplot(
  data = vot,
  mapping = aes(x = ..., y = VOT)
) +
  geom_jitter(width = 0.1) +
  facet_grid(~ ...)

6.3 Histograms and density plots

For many acoustic measurements, such as VOT, it is often very useful to visualize the distribution of the data in the form of a histogram or a density plot.

In a histogram, the values of interest are grouped into different ranges, or bins, and the height of each bar indicates the number of data points that fall into that bin.3

The number of bins is what’s called a free parameter. The user has to select a value for it, and the resulting visualization depends to some extent on the value chosen. ggplot() will select a default value, but this isn’t necessarily the appropriate one for the chosen data frame (though the defaults are often pretty sensible).

A density plot is similar to a histogram, but it uses a mathematical trick to interpolate values over a continuous range, rather than using bins. One advantage of a density plot is that its shape isn’t affected by the number of bins chosen (since there are no bins).

Both histograms and density plots are easy to generate with ggplot(). You just need to use the relative geometries: geom_histogram() and geom_density().

ggplot(
  vot,
  aes(x = VOT)
) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(
  vot,
  aes(x = VOT)
) +
  geom_density() +
  geom_rug()        # adds ticks for each observation

Both calls have a very similar form, and differ in similar ways from the previous function calls.

In the mapping specification with aes(), we only need to specify an x-axis variable, in this case VOT.

In these basic plots, we don’t have to supply any arguments to the geometries we use (geom_histogram() or geom_density(), respectively).

However, when we make a histogram, we get a message suggesting that we change the value of binwidth. You can do this by specifying the binwidth argument inside of geom_histogram(). Try a few different values and see what happens. What do you think is a sensible value for this data frame?

When making a density plot that isn’t faceted, it can sometimes be hard to see all the different plots if they are stacked on top of one another. You can make them semi-transparent by including an alpha = argument to geom_density. Try alpha = 0.5 to get an idea of this behavior.

It’s not clear if the density plots are really appropriate for this data frame, because there are so few values to begin with. But there is something very obvious in both plots that is less obvious in the boxplots: for the year 2001 data, there are several tokens of /k/ that have a much, much longer VOT than the average. We might want to look more closely at those tokens, and/or consider excluding them from the summary if we have good reason to believe they are atypical or anomalous (because their inclusion increases the mean of this category).

Note that it is NOT acceptable to throw away outliers just for being outliers! There could well be some good reason why these tokens have a longer VOT, and if so, you want to know what that is. The point here is simply that not all relevant aspects of the data are necessarily revealed in a single graphical visualization (or descriptive statistic, for that matter). Once again: use lots of plots, and think.

7 Saving plots

You’ve spent hours (OK, hopefully not hours!) creating the perfect plot. How can you save it to put in your write-up or report? There are a couple of options:

  1. Use one of the options under the Export tab in the RStudio Plots window:

Here you can save either as a PDF or as PNG, JPG, etc., or you can copy the plot to the Clipboard. Very flexible.

  1. If you want the plots to be automatically saved to disk as part of a script, you can use ggsave(). See the help page ?ggsave for more information.

Minimally, you need to specify a file path, i.e. the location where you want the plot to be saved and the name of the file.

By default, ggsave() will save the last generated plot.

ggplot(
  vot,
  aes(x = VOT)
) +
  geom_histogram()

ggsave("img/vot-hist.png")

But you can specify a plot if you assign the plot into a variable:

dens <- ggplot(
  vot,
  aes(x = VOT)
) +
  geom_density()

hist <- ggplot(
  vot,
  aes(x = VOT)
) +
  geom_histogram()

ggsave("img/vot-dens.png", plot = dens)
## Saving 7 x 5 in image

8 Other things you can do

This tutorial only scratches the surface of what you can do with R and ggplot2. There are many online tutorials and resources that you can use to learn how to modify and customize your plots further (the Cookbook for R is often a good place to start). Here I’ll list just a few very commonly requested modifications:

8.1 How do I change the plot background?

Try adding + theme_minimal() or + theme_dark() to the end of your function call. Look at e.g. ?theme_minimal for a complete list of themes.

8.2 How do I change the axis labels?

There are two ways to do this. The first way is to rename the columns themselves. But we won’t use that here.

The second way is to specify the labels with the labs() function:

ggplot(
  vot,
  aes(x = Consonant, y = VOT)
) +
  geom_jitter(width = 0.1) +
  facet_grid(~ year) +
  labs(
    x = "Consonant type",
    y = "VOT (ms)",
    # You can also add title, subtitle, and caption
    title = "VOT by year"
  )

8.3 How do I change the order in which factor levels appear?

R defaults to alphabetic/numeric order, but oftentimes we might want to change this (for instance, it might make more linguistic ‘sense’ for the consonants to be ordered by place of articulation from most to least proximal, i.e. labial - coronal - dorsal).

To do this, we have to explicitly change the order in the underlying data frame.

We can do it with the mutate() function.

mutate(
  # Specify the data frame to mutate
  vot,
  # Specify the new column name and what to make it equal to
  Consonant = factor(Consonant, levels = c("t", "k"))
) %>%
ggplot(aes(x = Consonant, y = VOT)) +
  geom_jitter(width = 0.1) +
  facet_grid(~ year) +
  labs(
    x = "Consonant type",
    y = "VOT (ms)",
    # You can also add title, subtitle, and caption
    title = "VOT by year"
  )

8.4 How do I transform data to z-scores (standardisation)?

To deal with idiosyncratic differences in average formant values across speakers, you can z-transform your data.

Z-transformation or Lobanov transformation, most commonly known outside phonetics as standardisation, can be easily done in R.

alb_vow <- readRDS("data/formants.rds")

Let’s first see what the untransformed data looks like.

centroids <- aggregate(
  cbind(F2, F1) ~ vowel,
  alb_vow,
  mean
)

alb_vow %>%
  ggplot(aes(x = F2, y = F1, group = vowel)) +
  scale_x_reverse(name = "F2 Hz", position = "top") +
  scale_y_reverse(name = "F1 Hz", position = "right") +
  geom_point(aes(colour = vowel), alpha = 0.5) +
  stat_ellipse(aes(colour = vowel), type = "norm", size = 1) +
  geom_text(data = centroids, aes(F2, F1, label = vowel), size = 7) +
  theme(legend.position = "none") +
  annotate("polygon", x = c(1200, 400, 400, 1200), y = c(1200, 400, 1200, 1200), alpha = 0.5)

To z-transform your formant data, you need to group the data by speaker and then calculate z-scores for each formant separately.

Z-scores are calculated with the simple formula (F - mean(F)) / sd(F), where F is a vector or column with formant measurements.

alb_vow <- alb_vow %>%
  group_by(speaker) %>%
  mutate(
    F1_z = (F1 - mean(F1)) / sd(F1),
    F2_z = (F2 - mean(F2)) / sd(F2)
  )

Now let’s plot the transformed data.

centroids_z <- aggregate(cbind(F2_z,F1_z)~vowel, alb_vow, mean)

alb_vow %>%
  ggplot(aes(x = F2_z, y = F1_z, group = vowel)) +
  scale_x_reverse(name = "F2 (z-scores)", position = "top") +
  scale_y_reverse(name = "F1 (z-scores)", position = "right") +
  geom_point(aes(colour = vowel), alpha = 0.5) +
  stat_ellipse(aes(colour = vowel), type = "norm", size = 1) +
  geom_text(data = centroids_z, aes(F2_z, F1_z, label = vowel), size = 7) +
  theme(legend.position = "none")


  1. This file is a plain text file, so you can open it in an application like Notepad (on a Windows PC) or TextEdit (on a Mac).↩︎

  2. You don’t have to have a carriage return between the different components of the function call. R doesn’t care if the entire thing is written on one line. But since RStudio understands the ggplot2 syntax, it auto-indents for you in a way that can make the call much more readable for humans, which can help to catch errors and also to make changes later.↩︎

  3. Do not mix up histograms and bar plots. They are different!↩︎