summarise(shallow, RT_mean = mean(RT))
11 Summarise data
11.1 Summarise with summarise()
Now that you have learned about summary measures, we can talk about how to summarise data in R, rather than just vectors as we did in the previous chapter. When you work with data, you always want to get summary measures for most of the variables in the data. Data reports usually include summary measures. It is also important to understand which summary measure is appropriate for which type of variable, which was covered in the previous section. Now, you will learn how to obtain summary measures using the summarise()
function from the dplyr tidyverse package. Let’s practice with the data from Song et al. (2020) you read in Chapter 9. We want to get a measure of central tendency and dispersion for the reaction times, in the RT
column. In order to decide which measures to pick, think about the nature of the RT
variable. Reaction times is a numeric and continuous statistical variable, and it can only have positive values. So the mean and standard deviations are appropriate measures. Let’s start with the mean of the reaction time column RT
. Go to your week-02.R
script: if you followed Chapter 9 (and you should have), the script should already have the code to attach the tidyverse and read the song2020/shallow.csv
file into a variable called shallow
.
Now let’s calculate the mean of RT
with summarise()
. The summarise()
function takes at least two arguments: (1) the tibble to summarise, (2) one or more summary functions applied to columns in the tibble. In this case we just want the mean RTs. To get this, you write RT_mean = mean(RT)
which tells the function to calculate the mean of the RT
column and save the result in a new column called RT_mean
. Yes, summarise()
returns a tibble (a data frame)! It might seem overkill now, but you will see below that it is useful when you are grouping the data, so that for example you can get the mean of different groups in the data. Here is the code with its output:
Great! The mean reaction times of the entire sample is 867.3592 ms. Sometimes you might want to round the numbers. You can round numbers with the round()
function. For example:
<- 867.3592
num round(num)
[1] 867
round(num, 1)
[1] 867.4
round(num, 2)
[1] 867.36
The second argument of the round()
function sets the number of decimals to round to (by default, it is 0
, so the number is rounded to the nearest integer, that is, to the nearest whole number with no decimal values). Let’s recalculate the mean by rounding it this time.
summarise(shallow, RT_mean = round(mean(RT)))
What if we want also the standard deviation? Easy: we use the sd()
function. Round the mean and SD with the round()
function when you write the code in your week-02.R
script.
# round the mean and SD
summarise(shallow, RT_mean = mean(RT), RT_sd = sd(RT))
Now we know that reaction times are on average 867 ms long and have a standard deviation of about 293 ms (rounded to the nearest integer). Let’s go all the way and also get the minimum and maximum RT values with the min()
and max()
functions (again, round all the summary measures).
Fab! When writing a data report, you could write something like this.
Reaction times are on average 867 ms long (SD = 293 ms), with values ranging from 0 to 1994 ms.
Remember that standard deviations are a relative measure of how dispersed the data are around the mean: the higher the SD, the greater the dispersion around the mean, i.e. the greater the variability in the data. However, you won’t be able to compare standard deviations across different measures: for example, you can’t compare the standard deviation of reaction times and of vowel formants because the first is in milliseconds and the second in Hertz; these are two different numeric scales. When required, you can use the median()
function to calculate the median, instead of the mean()
. Go ahead and calculate the median reaction times in the data. Is it similar to the mean?
11.2 NA
: Not Available
Most base R functions, like mean()
, sd()
, median()
and so on, behave unexpectedly if the vector they are used on contains NA
values. NA
is a special object in R, that indicates that a value is Not Available, meaning that that observation does not have a value (or that the value was not observed in that case). For example, in the following numeric vector, there are 5 objects:
<- c(3, 5, 3, NA, 4) a
Four are numbers and one is NA
. If you calculate the mean of a
with mean()
something strange happens.
mean(a)
[1] NA
The functions returns NA
. This is because by default when just one value in the vector is NA
then operations on the vector will return NA
.
mean(a)
[1] NA
sum(a)
[1] NA
sd(a)
[1] NA
If you want to discard the NA
values when operating on a vector that contains them, you have to set the na.rm
(for “NA
remove”) argument to TRUE
.
mean(a, na.rm = TRUE)
[1] 3.75
sum(a, na.rm = TRUE)
[1] 15
sd(a, na.rm = TRUE)
[1] 0.9574271
11.3 Grouping data with group_by()
More often, you will want to calculate summary measures for specific subsets of the data. An elegant way of doing this is with the group_by()
function from dplyr. This function takes a tibble, groups the data based on the specified columns, and returns another tibble with the grouping.
<- group_by(shallow, Group) shallow_g
It looks as if nothing happened, but now the rows in the shallow_g
tibble are grouped depending on the value of Group
(L1
or L2
). If you print out the tibble in the console (just write shallow_g
in the Console and press enter), you will notice that the second line of the output says Groups: Group [2]
, like in the output below. This line tells you how the tibble is grouped: here it is grouped by Group
and there are two groups.
# A tibble: 6,500 × 11
# Groups: Group [2]
Group ID List Target ACC RT logRT Critical_Filler Word_Nonword
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 L1 L1_01 A banoshment 1 423 6.05 Filler Nonword
2 L1 L1_01 A unawareness 1 603 6.40 Critical Word
3 L1 L1_01 A unholiness 1 739 6.61 Critical Word
4 L1 L1_01 A bictimize 1 510 6.23 Filler Nonword
5 L1 L1_01 A unhappiness 1 370 5.91 Critical Word
6 L1 L1_01 A entertainer 1 689 6.54 Filler Word
7 L1 L1_01 A unsharpness 1 821 6.71 Critical Word
8 L1 L1_01 A fersistent 1 677 6.52 Filler Nonword
9 L1 L1_01 A specificity 0 798 6.68 Filler Word
10 L1 L1_01 A termination 1 610 6.41 Filler Word
# ℹ 6,490 more rows
# ℹ 2 more variables: Relation_type <chr>, Branching <chr>
The grouping information is stored as an “attribute” in the tibble, named groups
. You can check this attribute with attr()
. You get a tibble with the groupings. Hopefully now you understand that, even if nothing seems to have happened, the tibble has been grouped. Since you saved the output of group_by()
into a new variable shallow_g
, note that shallow
was not affected (try running atrr(shallow, "groups")
and you will get a NULL
). Here’s the output:
# A tibble: 2 × 2
Group .rows
<chr> <list<int>>
1 L1 [2,900]
2 L2 [3,600]
There are 2,900 rows in Group = L1 and 3,600 rows in Group = L2. Now let’s take the shallow_g
data and calculate summary measures for L1 and L2 participants separately (as per the Group
column).
This way of grouping the data first with group_by()
first and then using summarise()
on the grouped tibble works, but it can become tedious if you want to get summaries for different groups and/or combinations of groups. There is a more succinct way of doing this using the pipe |>
. Read on to learn about it.
11.3.1 What the pipe!?
Think of a pipe |>
as a teleporter. The pipe |>
teleports whatever is on its left into whatever is on its right. The pipe allows you to “stack” multiple operations into a pipeline, without the need to assign each output to a variable. This means that the code is more succinct and even more readable because the way you write code follows exactly the pipeline. So we can get summary measures for each group in Group
like so:
|>
shallow group_by(Group) |>
summarise(mean = round(mean(RT)))
The code says:
Take the
shallow
data.Pipe it into
group_by()
and group it byGroup
.Summarise the grouped data with
summarise()
.
Hopefully this just makes sense, but check the R Note box below if you want more details.
group_by()
can group according to more than one column, by listing the columns separated by commas (like group_by(Col1, Col2, Col3)
). When you list more than one column, the grouping is fully crossed: you get a group for each combination of the grouping columns. Try to group the data by Group
and Word_Nonword
and get summary measures.
11.4 Counting observations with count()
If you want to count observations you can use the summarise()
function with n()
, another dplyr function that returns the group size. For example, let’s count the number of languages by their endangerment status. The data in coretta2022/glot_status.rds
contains the endangerment status for 7,845 languages from Glottolog. There are thousands of languages in the world, but most of them are losing speakers, and some are already no longer spoken. The column status
contains the endangerment status of a language in the data, on a scale from not endangered
(languages with large populations of speakers) through threatened
, shifting
and nearly extinct
, to extinct
(languages that have no living speakers left). Read the coretta2022/glot_status.rds
data and check it out.
To count the number of languages by status, we group the data by status
and we summarise with n()
.
|>
glot_status group_by(status) |>
summarise(n = n())
This approach works. However, dplyr offers a more compact way to get counts with the count()
function! You can think of this function as a group_by/summarise
combo. You list the columns you want to group by as arguments to count()
and the output gives you a column n
with the counts. It works with a single column or more than one, like group_by()
.
|>
glot_status count(status)