11  Summarise data

11.1 Summarise with summarise()

Now that you have learned about summary measures, we can talk about how to summarise data in R, rather than just vectors as we did in the previous chapter. When you work with data, you always want to get summary measures for most of the variables in the data. Data reports usually include summary measures. It is also important to understand which summary measure is appropriate for which type of variable, which was covered in the previous section. Now, you will learn how to obtain summary measures using the summarise() function from the dplyr tidyverse package. Let’s practice with the data from Song et al. (2020) you read in Chapter 9. We want to get a measure of central tendency and dispersion for the reaction times, in the RT column. In order to decide which measures to pick, think about the nature of the RT variable. Reaction times is a numeric and continuous statistical variable, and it can only have positive values. So the mean and standard deviations are appropriate measures. Let’s start with the mean of the reaction time column RT. Go to your week-02.R script: if you followed Chapter 9 (and you should have), the script should already have the code to attach the tidyverse and read the song2020/shallow.csv file into a variable called shallow.

Now let’s calculate the mean of RT with summarise(). The summarise() function takes at least two arguments: (1) the tibble to summarise, (2) one or more summary functions applied to columns in the tibble. In this case we just want the mean RTs. To get this, you write RT_mean = mean(RT) which tells the function to calculate the mean of the RT column and save the result in a new column called RT_mean. Yes, summarise() returns a tibble (a data frame)! It might seem overkill now, but you will see below that it is useful when you are grouping the data, so that for example you can get the mean of different groups in the data. Here is the code with its output:

summarise(shallow, RT_mean = mean(RT))

Great! The mean reaction times of the entire sample is 867.3592 ms. Sometimes you might want to round the numbers. You can round numbers with the round() function. For example:

num <- 867.3592
round(num)
[1] 867
round(num, 1)
[1] 867.4
round(num, 2)
[1] 867.36

The second argument of the round() function sets the number of decimals to round to (by default, it is 0, so the number is rounded to the nearest integer, that is, to the nearest whole number with no decimal values). Let’s recalculate the mean by rounding it this time.

summarise(shallow, RT_mean = round(mean(RT)))

What if we want also the standard deviation? Easy: we use the sd() function. Round the mean and SD with the round() function when you write the code in your week-02.R script.

# round the mean and SD
summarise(shallow, RT_mean = mean(RT), RT_sd = sd(RT))

Now we know that reaction times are on average 867 ms long and have a standard deviation of about 293 ms (rounded to the nearest integer). Let’s go all the way and also get the minimum and maximum RT values with the min() and max() functions (again, round all the summary measures).

Exercise 1

Complete this code to also get the minimum and maximum RT and round all measures to the nearest integer.

summarise(
  shallow,
  RT_mean = mean(RT), RT_sd = sd(RT),
  RT_min = ..., RT_max = ...
)

The functions for minimum and maximum are just a few lines above! Have you tried it yourself before seeing the solution?

Show me
summarise(
  shallow,
  RT_mean = round(mean(RT)), RT_sd = round(sd(RT)),
  RT_min = round(min(RT)), RT_max = round(max(RT))
)

Fab! When writing a data report, you could write something like this.

Reaction times are on average 867 ms long (SD = 293 ms), with values ranging from 0 to 1994 ms.

Remember that standard deviations are a relative measure of how dispersed the data are around the mean: the higher the SD, the greater the dispersion around the mean, i.e. the greater the variability in the data. However, you won’t be able to compare standard deviations across different measures: for example, you can’t compare the standard deviation of reaction times and of vowel formants because the first is in milliseconds and the second in Hertz; these are two different numeric scales. When required, you can use the median() function to calculate the median, instead of the mean(). Go ahead and calculate the median reaction times in the data. Is it similar to the mean?

Exercise 2

Calculate the median of RTs in the shallow data.

11.2 NA: Not Available

Most base R functions, like mean(), sd(), median() and so on, behave unexpectedly if the vector they are used on contains NA values. NA is a special object in R, that indicates that a value is Not Available, meaning that that observation does not have a value (or that the value was not observed in that case). For example, in the following numeric vector, there are 5 objects:

a <- c(3, 5, 3, NA, 4)

Four are numbers and one is NA. If you calculate the mean of a with mean() something strange happens.

mean(a)
[1] NA

The functions returns NA. This is because by default when just one value in the vector is NA then operations on the vector will return NA.

mean(a)
[1] NA
sum(a)
[1] NA
sd(a)
[1] NA

If you want to discard the NA values when operating on a vector that contains them, you have to set the na.rm (for “NA remove”) argument to TRUE.

mean(a, na.rm = TRUE)
[1] 3.75
sum(a, na.rm = TRUE)
[1] 15
sd(a, na.rm = TRUE)
[1] 0.9574271
Quiz 1
  1. What does the na.rm argument of mean() do?
  2. Which is the mean of c(4, 23, NA, 5) when na.rm has the default value?

Check the documentation of ?mean.

11.3 Grouping data with group_by()

More often, you will want to calculate summary measures for specific subsets of the data. An elegant way of doing this is with the group_by() function from dplyr. This function takes a tibble, groups the data based on the specified columns, and returns another tibble with the grouping.

shallow_g <- group_by(shallow, Group)

It looks as if nothing happened, but now the rows in the shallow_g tibble are grouped depending on the value of Group (L1 or L2). If you print out the tibble in the console (just write shallow_g in the Console and press enter), you will notice that the second line of the output says Groups: Group [2], like in the output below. This line tells you how the tibble is grouped: here it is grouped by Group and there are two groups.

# A tibble: 6,500 × 11
# Groups:   Group [2]
   Group ID    List  Target        ACC    RT logRT Critical_Filler Word_Nonword
   <chr> <chr> <chr> <chr>       <dbl> <dbl> <dbl> <chr>           <chr>       
 1 L1    L1_01 A     banoshment      1   423  6.05 Filler          Nonword     
 2 L1    L1_01 A     unawareness     1   603  6.40 Critical        Word        
 3 L1    L1_01 A     unholiness      1   739  6.61 Critical        Word        
 4 L1    L1_01 A     bictimize       1   510  6.23 Filler          Nonword     
 5 L1    L1_01 A     unhappiness     1   370  5.91 Critical        Word        
 6 L1    L1_01 A     entertainer     1   689  6.54 Filler          Word        
 7 L1    L1_01 A     unsharpness     1   821  6.71 Critical        Word        
 8 L1    L1_01 A     fersistent      1   677  6.52 Filler          Nonword     
 9 L1    L1_01 A     specificity     0   798  6.68 Filler          Word        
10 L1    L1_01 A     termination     1   610  6.41 Filler          Word        
# ℹ 6,490 more rows
# ℹ 2 more variables: Relation_type <chr>, Branching <chr>

The grouping information is stored as an “attribute” in the tibble, named groups. You can check this attribute with attr(). You get a tibble with the groupings. Hopefully now you understand that, even if nothing seems to have happened, the tibble has been grouped. Since you saved the output of group_by() into a new variable shallow_g, note that shallow was not affected (try running atrr(shallow, "groups") and you will get a NULL). Here’s the output:

# A tibble: 2 × 2
  Group       .rows
  <chr> <list<int>>
1 L1        [2,900]
2 L2        [3,600]

There are 2,900 rows in Group = L1 and 3,600 rows in Group = L2. Now let’s take the shallow_g data and calculate summary measures for L1 and L2 participants separately (as per the Group column).

Exercise 3

Get the rounded mean, median, SD, minimum and maximum of RTs for L1 and L2 participants in shallow_g.

You can do it! You’ve done this above but with shallow. Now you just need to use shallow_g plus get the mininimum and maximum.

Show me
summarise(
  shallow_g,
  mean = round(mean(RT)),
  median = round(median(RT)),
  sd = round(sd(RT))
)

This way of grouping the data first with group_by() first and then using summarise() on the grouped tibble works, but it can become tedious if you want to get summaries for different groups and/or combinations of groups. There is a more succinct way of doing this using the pipe |>. Read on to learn about it.

11.3.1 What the pipe!?

Think of a pipe |> as a teleporter. The pipe |> teleports whatever is on its left into whatever is on its right. The pipe allows you to “stack” multiple operations into a pipeline, without the need to assign each output to a variable. This means that the code is more succinct and even more readable because the way you write code follows exactly the pipeline. So we can get summary measures for each group in Group like so:

shallow |> 
  group_by(Group) |> 
  summarise(mean = round(mean(RT)))

The code says:

  • Take the shallow data.

  • Pipe it into group_by() and group it by Group.

  • Summarise the grouped data with summarise().

Hopefully this just makes sense, but check the R Note box below if you want more details.

group_by() can group according to more than one column, by listing the columns separated by commas (like group_by(Col1, Col2, Col3)). When you list more than one column, the grouping is fully crossed: you get a group for each combination of the grouping columns. Try to group the data by Group and Word_Nonword and get summary measures.

Exercise 4

Group shallow by Group and Word_Nonword and get summary measures of RTs. Use the pipe.

group_by(Group, Word_Nonword)

11.4 Counting observations with count()

If you want to count observations you can use the summarise() function with n(), another dplyr function that returns the group size. For example, let’s count the number of languages by their endangerment status. The data in coretta2022/glot_status.rds contains the endangerment status for 7,845 languages from Glottolog. There are thousands of languages in the world, but most of them are losing speakers, and some are already no longer spoken. The column status contains the endangerment status of a language in the data, on a scale from not endangered (languages with large populations of speakers) through threatened, shifting and nearly extinct, to extinct (languages that have no living speakers left). Read the coretta2022/glot_status.rds data and check it out.

To count the number of languages by status, we group the data by status and we summarise with n().

glot_status |> 
  group_by(status) |> 
  summarise(n = n())

This approach works. However, dplyr offers a more compact way to get counts with the count() function! You can think of this function as a group_by/summarise combo. You list the columns you want to group by as arguments to count() and the output gives you a column n with the counts. It works with a single column or more than one, like group_by().

glot_status |> 
  count(status)
Exercise 5

Get the number of languages by status and Macroarea.

With the release of R 4.1.0, a new feature was introduced to the base language that has significantly improved the readability and expressiveness of R code: the native pipe operator, written as |>. The native pipe allows the result of one expression to be passed automatically as the first argument to another function. This simple idea has a profound impact on how we write R code, particularly when we are performing a sequence of data transformations.

Before the native pipe, it was common to see deeply nested function calls that could be difficult to read and reason about. For example, consider the task of computing the square root of the sum of a vector:

sqrt(sum(c(1, 2, 3, 4)))

While this is relatively simple, as functions become more complex and more transformations are chained together, nested calls quickly become cumbersome. The native pipe solves this by allowing you to write each operation in a left-to-right, stepwise manner, which mirrors the logical flow of data. The key principle of the native pipe is that the left-hand side (LHS) is evaluated first, and its result is automatically passed as the first argument to the right-hand side (RHS). This means that for a simple pipe like:

x |> f()

it is equivalent to writing:

f(x)

This principle is important because it defines the natural behavior of the pipe: whatever computation you produce on the LHS will be injected as the first input to the next function. Consider the mtcars dataset, which is built into R. Suppose we want to compute the average miles per gallon (mpg) for each number of cylinders (cyl). Using the native pipe in combination with tidyverse functions, the code is straightforward and highly readable:

library(dplyr)

mtcars |>
  group_by(cyl) |>
  summarise(avg_mpg = mean(mpg))

Let’s break this down:

  1. The mtcars dataset is the left-hand side. It is evaluated first and becomes the input for the next function.
  2. group_by(cyl) receives the dataset as its first argument, groups the data by the cyl column, and returns a grouped dataframe.
  3. The grouped dataframe is then piped into summarise(avg_mpg = mean(mpg)), which calculates the mean mpg for each cylinder group.

Notice that each step receives the output from the previous step as its first argument. This eliminates the need for intermediate variables and nested function calls, creating a natural, readable sequence of transformations.

Comparison with the magrittr pipe (%>%)

Before R introduced the native pipe, the magrittr package popularized piping with the %>% operator. Functionally, it achieves a very similar goal: passing the result of one expression to another function. For example, the earlier group_by and summarise operation can be written with magrittr as:

library(magrittr)

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg))

Some differences between the native pipe and magrittr:

  1. The native pipe is built into base R, so no external package is required.
  2. The native pipe always passes the LHS value to the first argument of the RHS function.
  3. magrittr allows more flexibility via the . placeholder, which can inject the LHS value into any argument.
  4. Performance-wise, the native pipe has minimal overhead compared to %>%, which is a function call.

Overall, the native pipe provides a simple, consistent, and readable way to chain operations, especially when working with tidyverse workflows. For users already familiar with %>%, the transition is intuitive, with the added benefit that this feature is now a part of base R.