Chapter 11 Helper functions

11.1 Introduction

In addition to the main dplyr verbs, the package provides quite a few helper functions. Helper functions are used in conjunction with the main verbs to make specific tasks and calculations a bit easier. Many of these are summarised in the dplyr cheat sheet — find the ‘Data transformation with dplyr’ cheat sheet and look under its Manipulate Variables, Vector Functions and Summary Functions sections. We are not going to review every single one of them in this chapter. Instead, we aim to point out where helper functions tend to be used and highlight a few of the more useful ones.

11.2 Working with select

There are a few helper functions that can be used with select. Their job is to make it easier to match variable names according to various criteria. We’ll look at the three simplest of these—look at the examples in the help file for select and the ‘Data transformation with dplyr’ cheat sheet to see what else is available.

We can select variables according to the sequence of characters used at the start of their name with the starts_with function. For example, to select all the variables in penguins that begin with the word “bill”, we use:

select(penguins, starts_with("Bill"))
## # A tibble: 344 × 2
##    bill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # ℹ 334 more rows

This returns a tibble containing just bill_length_mm and bill_depth_mm. There is also a helper function to select variables according to characters used at the end of their name—the ends_with function (no surprises there).

Notice that we quote the name we want to match against because starts_with expects a literal character value. This is not optional. Unusually, starts_with and ends_with are not case sensitive by default. For example, we passed starts_with the argument "Bill" instead of "bill", yet it still selected variables beginning with the character string "bill". If we want to select variables on a case-sensitive basis, we need to set an argument ignore.case to FALSE in starts_with and ends_with.

The last select helper function we’ll look at is called contains. This one allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass contains the argument "length":

select(penguins, contains("length"))
## # A tibble: 344 × 2
##    bill_length_mm flipper_length_mm
##             <dbl>             <int>
##  1           39.1               181
##  2           39.5               186
##  3           40.3               195
##  4           NA                  NA
##  5           36.7               193
##  6           39.3               190
##  7           38.9               181
##  8           39.2               195
##  9           34.1               193
## 10           42                 190
## # ℹ 334 more rows

This selects all the variables with the word ‘length’ in their name.

There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables that have some units at the end of their names (millimetres or grams):

select(penguins, ends_with("_mm"), ends_with("_g"))
## # A tibble: 344 × 4
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##             <dbl>         <dbl>             <int>       <int>
##  1           39.1          18.7               181        3750
##  2           39.5          17.4               186        3800
##  3           40.3          18                 195        3250
##  4           NA            NA                  NA          NA
##  5           36.7          19.3               193        3450
##  6           39.3          20.6               190        3650
##  7           38.9          17.8               181        3625
##  8           39.2          19.6               195        4675
##  9           34.1          18.1               193        3475
## 10           42            20.2               190        4250
## # ℹ 334 more rows

When we apply more than one selection criteria like this, the select function returns the variables that match any criteria, rather than the set that meets all of them.

11.3 Working with mutate and transmute

There are quite a few helper functions that can be used with mutate. These make it easier to carry out certain calculations that aren’t easy to do with base R. We won’t explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable, it’s worth looking at that ‘Data transformation with dplyr’ cheat sheet to see what options might be available.

We will look at one particularly useful helper function that’s used a lot when we need to recode a particular variable using mutate. The function is called case_when. It works by setting up a series of paired matching criteria and replacement values. For example, imagine that we want to replace the names in species with three-letter shortcodes for each species. This is how to achieve that using case_when with mutate:

penguins %>% 
  mutate(species = case_when(
           species == "Adelie"    ~ "ADL",
           species == "Gentoo"    ~ "GEN",
           species == "Chinstrap" ~ "CHN",
           TRUE ~ "UNKNOWN"
         )) 
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 ADL     Torgersen           39.1          18.7               181        3750
##  2 ADL     Torgersen           39.5          17.4               186        3800
##  3 ADL     Torgersen           40.3          18                 195        3250
##  4 ADL     Torgersen           NA            NA                  NA          NA
##  5 ADL     Torgersen           36.7          19.3               193        3450
##  6 ADL     Torgersen           39.3          20.6               190        3650
##  7 ADL     Torgersen           38.9          17.8               181        3625
##  8 ADL     Torgersen           39.2          19.6               195        4675
##  9 ADL     Torgersen           34.1          18.1               193        3475
## 10 ADL     Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

The mutate bit of this is not new. Look at the case_when component—there are four criteria. The first of these is species == "Adelie" ~ "ADL" . The way to read this is, “look for cases where the value of species is equal to "Adelie", and where that is true spit out the value "ADL"”. case_when steps through each criterion like this in turn, trying to find a match. The last one TRUE ~ "UNKNOWN" acts as a catch-all for the non-matches.

This looks confusing at first but it does make sense with a bit of practise, and recoding variables using case_when is a lot easier than going through a spreadsheet by hand.

11.4 Working with filter

There aren’t that many dplyr helper function that works with filter. In fact, we’ve already looked at the most useful one: the between function. This is used to identify cases where the values of a numeric variable lie inside a defined range. For example, if we want all the individuals that had a body mass in the 4-5kg range, we could use:

filter(penguins, between(body_mass_g, 4000, 5000))
## # A tibble: 116 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.2          19.6               195        4675
##  2 Adelie  Torgersen           42            20.2               190        4250
##  3 Adelie  Torgersen           34.6          21.1               198        4400
##  4 Adelie  Torgersen           42.5          20.7               197        4500
##  5 Adelie  Torgersen           46            21.5               194        4200
##  6 Adelie  Dream               39.2          21.1               196        4150
##  7 Adelie  Dream               39.8          19.1               184        4650
##  8 Adelie  Dream               44.1          19.7               196        4400
##  9 Adelie  Dream               39.6          18.8               190        4600
## 10 Adelie  Dream               42.3          21.2               191        4150
## # ℹ 106 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

11.5 Working with summarise

There are a small number dplyr helper functions that can be used with summarise. These generally provide summaries that aren’t available directly using base R functions. For example, we’ve already seen the n_distinct function in action. This can be used to calculate the number of distinct values of a variable:

summarise(penguins, 
          num_species = n_distinct(species),
          num_island  = n_distinct(island ))
## # A tibble: 1 × 2
##   num_species num_island
##         <int>      <int>
## 1           3          3

This confirms what we already knew—that there are three unique species and three unique islands in the penguins data set.