Chapter 9 Summarising and grouping
This chapter will explore the summarise
and group_by
verbs. We consider together because they are often used in combination. Their usage is also a bit different from the other dplyr verbs we’ve encountered. Here’s a quick summary of what they do:
The
group_by
function adds information to its input (a data frame or tibble), which makes subsequent calculations happen on a group-specific basis.The
summarise
function is a data reduction function that calculates single-number summaries of one or more variables, respecting the group structure if present.
We illustrate these ideas using the Paler penguins data set, which we assume has been read into a tibble called peguins
.
9.1 Summarising variables with summarise
We use summarise
to calculate summaries of variables in an object containing our data. We do this kind of calculation all the time when analysing data. In terms of pseudo-code, usage of summarise
looks like this:
summarise(<data>, <expression-1>, <expression-2>, ...)
The first argument, <data>
, must be the name of the data frame or tibble containing our data. We then include a series of one or more additional arguments; each of these is a valid R expression involving at least one variable in <data>
. These are given by the pseudo-code placeholder <expression-1>, <expression-2>, ...
, where <expression-1>
and <expression-2>
represent the first two expressions, and the ...
is acting as placeholder for the remaining expressions. These expressions can be any calculation involving R functions that returns a vector of some kind.
The summarise
function seems to work a lot like mutate
. The main difference is that the expressions mutate
uses have to all return a vector of the same length as their inputs. In contrast, summarise
expressions used all have to produce the same length output, but those outputs can be any length. They often return a single value because they are summarising the data in some way, but they don’t have to.
The summarise
verb is best understood by example. The dplyr function n_distinct
takes a calculates the number of distinct (i.e. unique) cases in a vector. We can use n_distinct
with summarise
to calculate the number of unique vales of the bill_length_mm
and bill_depth_mm
variables like this:
summarise(penguins, n_distinct(bill_length_mm), n_distinct(bill_depth_mm))
## # A tibble: 1 × 2
## `n_distinct(bill_length_mm)` `n_distinct(bill_depth_mm)`
## <int> <int>
## 1 165 81
Notice what kind of object summarise
returns—it’s a tibble with one row and two columns: two columns because we calculated two counts, and one row containing because we only one set of counts. There are a few other things to note about how summarise
works:
- The expression that performs each calculation is not surrounded by quotes because it’s an expression that it ‘does a calculation’.
- The order of the columns in the output is the same as the order in which they were created in the
<expression-1>, <expression-2>, ...
list. summarise
returns the same kind of data object as its input—it returns a data frame if our data was originally in a data frame, or a tibble if it was in a tibble.- If we don’t specify a name
summarise
uses the actual R expression to name the columns of its output (e.g.n_distinct(bill_length_mm)
)
Variable names based on the calculation (e.g. n_distinct(bill_length_mm)
) are not ideal because they are long and contain special reserved characters like (
. This makes it difficult refer to columns in the output because we have to remember to place back ticks (`
) around their name whenever we want to refer to them.
Fortunately, the summarise
function can name the new variables at the same time as they are created (just like mutate
). We do this by naming the arguments using =
, placing the name we require on the left hand side. For example:
summarise(penguins,
n_bill_length = n_distinct(bill_length_mm),
n_bill_depth = n_distinct(bill_depth_mm))
## # A tibble: 1 × 2
## n_bill_length n_bill_depth
## <int> <int>
## 1 165 81
This time we end up with summary data set that has reasonable column names. Notice how we organised that example—we placed each calculation on a new line. We don’t have to do this, but since R doesn’t care about white space, we can use newlines and spaces to keep everything more human-readable. It pays to organise summarise
calculations like this when they become longer.
9.1.1 More complicated calculations with summarise
Many useful base R functions can be used with summarise
. Of particular value are those that calculate various summaries of numeric variables are, such as:
min
andmax
calculate the minimum and maximum values,mean
andmedian
calculate the mean and median, andsd
andvar
calculate the standard deviation and variance.
We do need to pay attention when using base R functions with dplyr. Take a look at this attempt to use summarise to calculate the mean of bill_length_mm
and bill_length_mm
:
summarise(penguins,
n_bill_length = mean(bill_length_mm),
n_bill_depth = mean(bill_depth_mm))
## # A tibble: 1 × 2
## n_bill_length n_bill_depth
## <dbl> <dbl>
## 1 NA NA
No numbers—just a pair of NA
s. We forgot about the presence of missing values in the penguins
data. Both bill_length_mm
and bill_depth_mm
contain missing values. When the mean
function encounters even one missing value in its input its default behaviour is to spit out NA
. It is possible to change that behaviour by setting the na.rm
argument of mean
:
summarise(penguins,
n_bill_length = mean(bill_length_mm, na.rm = TRUE),
n_bill_depth = mean(bill_depth_mm, na.rm = TRUE))
## # A tibble: 1 × 2
## n_bill_length n_bill_depth
## <dbl> <dbl>
## 1 43.9 17.2
This example demonstrates something important—the functions we use within summarise
often have their own arguments, and we sometimes need to set those arguments to perform the calculation we want.
Almost any R code can be used as summarise
expressions. This means we can combine more than one function to build up arbitrarily complicated calculations. For example, if we need to know the ratio of the mean bill length and mean bill width in penguins
, we would use:
summarise(penguins,
ratio = mean(bill_length_mm, na.rm = TRUE) / mean(bill_depth_mm, na.rm = TRUE))
## # A tibble: 1 × 1
## ratio
## <dbl>
## 1 2.56
The ability to work with arbitrary expressions makes summarise
(and mutate
) very powerful.
9.2 Grouped operations using group_by
Performing a calculation with one or more variables using the whole data set can be useful. However, we often need to carry out calculations on different subsets of our data. For example, it’s more useful to know how the mean bill length and depth vary among the different species in the penguins
data set, rather than knowing the overall mean of these traits. We could calculate separate means by using filter
to create different subsets of penguins
, and then use summary
on each of these to calculate the means. This would get the job done, but it’s inefficient and quickly becomes tiresome if we have to work with many groups.
The group_by
function provides an elegant solution to this kind of problem. All the group_by
function does is add a bit of information to a tibble or data frame. In effect, it defines subsets of data based on one or more grouping variables. That’s all it does.
The magic happens when the grouped object is used with a dplyr verb like summarise
or mutate
. Once a the data has been tagged with grouping information, operations that involve dplyr verbs are carried out on separate subsets of the data—defined by the values of the grouping variable(s)—and then combined.
Basic usage of group_by
looks like this:
group_by(<data>, <variable-1>, <variable-2>, ...)
The first argument, <data>
, must be the name of the object containing our data. We then have to include one or more additional arguments, where each one is the name of a variable in <data>
. We have expressed this as <variable-1>, <variable-2>, ...
, where <variable-1>
and <variable-2>
are names of the first two variables, and the ...
is acting as a placeholder for the remaining variables.
We’ll illustrate group_by
by using it alongside summarise
. We’re aiming to calculate the mean bill length for each species in penguins
. This is a two-step process. The first step uses group_by
to add grouping information to penguins
. Take a look at what we end up with when we do that:
group_by(penguins, species)
## # A tibble: 344 × 8
## # Groups: species [3]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
Compare this to the output produced when we print the original penguins
data set:
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
There is very little difference—group_by
really doesn’t do much on its own. The main change is that printing the tibble resulting from the group_by
operation shows a bit of additional information at the top: Groups: species [3]
. This tells us that the tibble is now grouped by the species
variable. The [3]
part tells us that there are three different groups (i.e. species of penguin). The only thing group_by
did was add this grouping information to a copy of penguins
.
The original penguins
object was not altered in any way. If we want to do anything useful with the grouped tibble we need to assign it a name so that we can work with it:
<- group_by(penguins, species) penguins_by_species
Now we have a grouped tibble called penguins_by_species
in which the value of species
define the different groups—any row where species
is equal to ‘Adelie’ is assigned to the first group, any row where species
is equal to ‘Chinstrap’ is assigned to a second group, and any row where species
is equal to ‘Gentoo’ is assigned to a third group.
dplyr operations on this tibble will now be performed on a ‘by group’ basis. To see this in action, we use summarise
to calculate the mean bill length:
summarise(penguins_by_species,
mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
## # A tibble: 3 × 2
## species mean_bill_length
## <chr> <dbl>
## 1 Adelie 38.8
## 2 Chinstrap 48.8
## 3 Gentoo 47.5
This is part two of the two-step process mentioned above. When we used summarise
on an ungrouped object, the result was a tibble with one row—the overall global mean. Now the resulting tibble has three rows, one for each species in the data set. The mean_bill_length
column shows the mean bill lengths for each species. The species
column tells us what species each mean belongs to. Notice that summarise
also printed an (un)helpful message:
`summarise()` ungrouping output (override with `.groups` argument)
There’s no need to worry about this. It is simply saying that summarise
has removed the grouping information from the resulting tibble.
We can also carry out multiple calculations with grouped data if we need to. For example, if we need to calculate the mean bill length and mean bill depth for each species, we would use the grouped version of penguins
like this:
summarise(penguins_by_species,
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
mean_bill_depth = mean(bill_depth_mm, na.rm = TRUE))
## # A tibble: 3 × 3
## species mean_bill_length mean_bill_depth
## <chr> <dbl> <dbl>
## 1 Adelie 38.8 18.3
## 2 Chinstrap 48.8 18.4
## 3 Gentoo 47.5 15.0
9.2.1 More than one grouping variable
What if we need to calculate summaries using more than one grouping variable? The workflow is unchanged. Assume we need to know the mean body mass of males and females of each penguin species. First, we make a grouped copy of penguins
using the appropriate grouping variables:
<- group_by(penguins, species, sex) penguins_by_species_sex
We called the grouped tibble penguins_by_species_sex
. Look at what happens when we print this:
penguins_by_species_sex
## # A tibble: 344 × 8
## # Groups: species, sex [8]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
We see Groups: species, sex [8]
near the top, which tells us that the tibble is grouped by two variables (species
and sex
) with eight unique combinations of values. That seems odd at first—there are three species and two sexes represented in this dataset, which gives six possible combinations at most.
The reason for the discrepancy becomes clear when we move on to calculate the mean body mass for each sex-species combination:
summarise(penguins_by_species_sex,
body_mass_g = mean(body_mass_g, na.rm = TRUE))
## `summarise()` has grouped output by 'species'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups: species [3]
## species sex body_mass_g
## <chr> <chr> <dbl>
## 1 Adelie female 3369.
## 2 Adelie male 4043.
## 3 Adelie <NA> 3540
## 4 Chinstrap female 3527.
## 5 Chinstrap male 3939.
## 6 Gentoo female 4680.
## 7 Gentoo male 5485.
## 8 Gentoo <NA> 4588.
This shows mean body mass for each unique combination of species
and sex
. The first line shows that the mean body mass associated with female Adelie penguins is 3369; the second line shows us the mean body mass associated with male Adelie penguins is 4043. The third line shows us the mean body mass of Adelie penguins where sex is missing (NA
). That explains why we ended up with more groups than unique combinations of species
and sex
— missing values create extra groups.
9.2.2 Using group_by
with other verbs
The summarise
function is the dplyr verb that is most often used with grouped data. However, all the main dplyr verbs will alter their behaviour to respect group information when it is present. For example, when mutate
or transmute
are used with a grouped object the calculation of new variables occur “by group”. Here’s an example:
# create a data set 'mean centred' bill length variable
transmute(penguins_by_species_sex,
body_mass_cen = body_mass_g - mean(body_mass_g, na.rm = TRUE))
## # A tibble: 344 × 3
## # Groups: species, sex [8]
## species sex body_mass_cen
## <chr> <chr> <dbl>
## 1 Adelie male -293.
## 2 Adelie female 431.
## 3 Adelie female -119.
## 4 Adelie <NA> NA
## 5 Adelie female 81.2
## 6 Adelie male -393.
## 7 Adelie female 256.
## 8 Adelie male 632.
## 9 Adelie <NA> -65
## 10 Adelie <NA> 710
## # ℹ 334 more rows
This calculated a standardised measure of body mass. The new body_mass_cen
variable contains the difference between the original body mass and its mean in the appropriate species-sex group (rather than the overall mean).
9.3 Removing grouping information
On occasion, it’s necessary to remove grouping information and revert to operating on the whole data set. The ungroup
function removes grouping information:
ungroup(penguins_by_species)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
Looking at the top right of the printed summary, we can see that the Group:
part is now gone—the ungroup
function effectively recreated the original penguins
tibble.