Chapter 11 Helper functions
11.1 Introduction
In addition to the main dplyr verbs, the package provides quite a few helper functions. Helper functions are used in conjunction with the main verbs to make specific tasks and calculations a bit easier. Many of these are summarised in the dplyr cheat sheet — find the ‘Data transformation with dplyr’ cheat sheet and look under its Manipulate Variables, Vector Functions and Summary Functions sections. We are not going to review every single one of them in this chapter. Instead, we aim to point out where helper functions tend to be used and highlight a few of the more useful ones.
11.2 Working with select
There are a few helper functions that can be used with select
. Their job is to make it easier to match variable names according to various criteria. We’ll look at the three simplest of these—look at the examples in the help file for select
and the ‘Data transformation with dplyr’ cheat sheet to see what else is available.
We can select variables according to the sequence of characters used at the start of their name with the starts_with
function. For example, to select all the variables in penguins
that begin with the word “bill”, we use:
select(penguins, starts_with("Bill"))
## # A tibble: 344 × 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18
## 4 NA NA
## 5 36.7 19.3
## 6 39.3 20.6
## 7 38.9 17.8
## 8 39.2 19.6
## 9 34.1 18.1
## 10 42 20.2
## # ℹ 334 more rows
This returns a tibble containing just bill_length_mm
and bill_depth_mm
. There is also a helper function to select variables according to characters used at the end of their name—the ends_with
function (no surprises there).
Notice that we quote the name we want to match against because starts_with
expects a literal character value. This is not optional. Unusually, starts_with
and ends_with
are not case sensitive by default. For example, we passed starts_with
the argument "Bill"
instead of "bill"
, yet it still selected variables beginning with the character string "bill"
. If we want to select variables on a case-sensitive basis, we need to set an argument ignore.case
to FALSE
in starts_with
and ends_with
.
The last select
helper function we’ll look at is called contains
. This one allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass contains
the argument "length"
:
select(penguins, contains("length"))
## # A tibble: 344 × 2
## bill_length_mm flipper_length_mm
## <dbl> <int>
## 1 39.1 181
## 2 39.5 186
## 3 40.3 195
## 4 NA NA
## 5 36.7 193
## 6 39.3 190
## 7 38.9 181
## 8 39.2 195
## 9 34.1 193
## 10 42 190
## # ℹ 334 more rows
This selects all the variables with the word ‘length’ in their name.
There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables that have some units at the end of their names (millimetres or grams):
select(penguins, ends_with("_mm"), ends_with("_g"))
## # A tibble: 344 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <int> <int>
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
## 7 38.9 17.8 181 3625
## 8 39.2 19.6 195 4675
## 9 34.1 18.1 193 3475
## 10 42 20.2 190 4250
## # ℹ 334 more rows
When we apply more than one selection criteria like this, the select
function returns the variables that match any criteria, rather than the set that meets all of them.
11.3 Working with mutate
and transmute
There are quite a few helper functions that can be used with mutate
. These make it easier to carry out certain calculations that aren’t easy to do with base R. We won’t explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable, it’s worth looking at that ‘Data transformation with dplyr’ cheat sheet to see what options might be available.
We will look at one particularly useful helper function that’s used a lot when we need to recode a particular variable using mutate
. The function is called case_when
. It works by setting up a series of paired matching criteria and replacement values. For example, imagine that we want to replace the names in species
with three-letter shortcodes for each species. This is how to achieve that using case_when
with mutate
:
%>%
penguins mutate(species = case_when(
== "Adelie" ~ "ADL",
species == "Gentoo" ~ "GEN",
species == "Chinstrap" ~ "CHN",
species TRUE ~ "UNKNOWN"
))
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 ADL Torgersen 39.1 18.7 181 3750
## 2 ADL Torgersen 39.5 17.4 186 3800
## 3 ADL Torgersen 40.3 18 195 3250
## 4 ADL Torgersen NA NA NA NA
## 5 ADL Torgersen 36.7 19.3 193 3450
## 6 ADL Torgersen 39.3 20.6 190 3650
## 7 ADL Torgersen 38.9 17.8 181 3625
## 8 ADL Torgersen 39.2 19.6 195 4675
## 9 ADL Torgersen 34.1 18.1 193 3475
## 10 ADL Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
The mutate
bit of this is not new. Look at the case_when
component—there are four criteria. The first of these is species == "Adelie" ~ "ADL"
. The way to read this is, “look for cases where the value of species
is equal to "Adelie"
, and where that is true spit out the value "ADL"
”. case_when
steps through each criterion like this in turn, trying to find a match. The last one TRUE ~ "UNKNOWN"
acts as a catch-all for the non-matches.
This looks confusing at first but it does make sense with a bit of practise, and recoding variables using case_when
is a lot easier than going through a spreadsheet by hand.
11.4 Working with filter
There aren’t that many dplyr helper function that works with filter
. In fact, we’ve already looked at the most useful one: the between
function. This is used to identify cases where the values of a numeric variable lie inside a defined range. For example, if we want all the individuals that had a body mass in the 4-5kg range, we could use:
filter(penguins, between(body_mass_g, 4000, 5000))
## # A tibble: 116 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.2 19.6 195 4675
## 2 Adelie Torgersen 42 20.2 190 4250
## 3 Adelie Torgersen 34.6 21.1 198 4400
## 4 Adelie Torgersen 42.5 20.7 197 4500
## 5 Adelie Torgersen 46 21.5 194 4200
## 6 Adelie Dream 39.2 21.1 196 4150
## 7 Adelie Dream 39.8 19.1 184 4650
## 8 Adelie Dream 44.1 19.7 196 4400
## 9 Adelie Dream 39.6 18.8 190 4600
## 10 Adelie Dream 42.3 21.2 191 4150
## # ℹ 106 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
11.5 Working with summarise
There are a small number dplyr helper functions that can be used with summarise
. These generally provide summaries that aren’t available directly using base R functions. For example, we’ve already seen the n_distinct
function in action. This can be used to calculate the number of distinct values of a variable:
summarise(penguins,
num_species = n_distinct(species),
num_island = n_distinct(island ))
## # A tibble: 1 × 2
## num_species num_island
## <int> <int>
## 1 3 3
This confirms what we already knew—that there are three unique species and three unique islands in the penguins
data set.