Chapter 7 Working with variables

7.1 Introduction

This chapter will explore the dplyr select and mutate verbs, and the closely related rename and transmute verbs. We consider these functions together because they operate on the variables (i.e. the columns) of a data frame or tibble:

  • The select function selects a subset of variables to retain and (optionally) renames them in the process.
  • The mutate function creates new variables from pre-existing ones and retains the original variables.
  • The rename function renames one or more variables while keeping the remaining variable names unchanged.
  • The transmute function creates new variables from pre-existing ones and drops the original variables.

7.1.1 Getting ready

Obviously, we need to have first installed dplyr package to use it. Assuming that’s been done, we need to load and attach the package in the current session:

library("dplyr")

We will use the Palmer penguins data to illustrate the ideas in this chapter. Remember—the previous chapter described this data set and explained where to find it. The examples below assume it was read into R as a tibble with the name penguins.

7.2 Subset variables with select

We use select to select variables from a data frame or tibble. This is used when we have a data set with many variables but only need to work with a subset of these. Basic usage of select looks like this:

select(<data>, <variable-1>, <variable-2>, ...)

Remember—this is not an example we can run. This is a pseudocode designed to provide a generic description of how we use select. Let’s look at the arguments of select:

  • The first argument, <data>, must be the name of the object containing our data (usually a data frame or tibble). This is not optional—dplyr functions only exist to manipulate data.
  • We then include a series of one or more additional arguments, where each one is the name of a variable in <data>. We’ve expressed this as <variable-1>, <variable-2>, ..., where <variable-1> and <variable-2> are names of the first two variables. The ... is acting as a placeholder for the remaining variables. There could be any number of these.

It’s easiest to understand how a function like select works by seeing it in action. We select the species, bill_length_mm and bill_depth_mm variables from penguins like this:

select(penguins, species, bill_length_mm, bill_depth_mm)
## # A tibble: 344 × 3
##    species bill_length_mm bill_depth_mm
##    <chr>            <dbl>         <dbl>
##  1 Adelie            39.1          18.7
##  2 Adelie            39.5          17.4
##  3 Adelie            40.3          18  
##  4 Adelie            NA            NA  
##  5 Adelie            36.7          19.3
##  6 Adelie            39.3          20.6
##  7 Adelie            38.9          17.8
##  8 Adelie            39.2          19.6
##  9 Adelie            34.1          18.1
## 10 Adelie            42            20.2
## # ℹ 334 more rows

Hopefully, nothing about this example is too surprising. However, there are a few subtleties buried in that example:

  • The select function is designed to work in a non-standard way which means variable names should not be surrounded by quotes. The one exception is when a name has a space in it. Under those circumstances, it has to be quoted with backticks, e.g. `variable 1`.
  • The select function does not have ‘side effects’. This means is that it does not change the original penguins object. We printed the result produced by select to the Console, so we can’t access the modified data set. If we need to use the result, we have to assign it a name using <-.
  • The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied to the argument list. This means we can reorder variables at the same time as selecting them if we need to.
  • The select function will return the same kind of data object we give it to work on. It returns a data frame if our data was in a data frame and a tibble if it was a tibble. In this example, R prints a tibble because penguins was a tibble.

That second point is important—we have to remember to assign the result a name using <- if we want to keep it and use it later. For example, we might call the result of that last example penguins_bill:

penguins_bill <- select(penguins, species, bill_length_mm, bill_depth_mm)

Now that we’ve named the new data set created by select we can refer to it by that name whenever we need it:

penguins_bill
## # A tibble: 344 × 3
##    species bill_length_mm bill_depth_mm
##    <chr>            <dbl>         <dbl>
##  1 Adelie            39.1          18.7
##  2 Adelie            39.5          17.4
##  3 Adelie            40.3          18  
##  4 Adelie            NA            NA  
##  5 Adelie            36.7          19.3
##  6 Adelie            39.3          20.6
##  7 Adelie            38.9          17.8
##  8 Adelie            39.2          19.6
##  9 Adelie            34.1          18.1
## 10 Adelie            42            20.2
## # ℹ 334 more rows

Remember—the original penguins data is completely unchanged:

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

7.2.1 Alternative ways to identify variables with select

It’s sometimes more convenient to use select to subset variables by specifying those we do not need, rather than specifying of the ones to keep. We can use the ! operator to indicate that certain variables should be dropped. For example, to get rid of the bill_depth_mm and bill_length_mm columns, we could use:

select(penguins, !bill_depth_mm, !bill_length_mm)
## # A tibble: 344 × 8
##    species island    bill_length_mm flipper_length_mm body_mass_g sex     year
##    <chr>   <chr>              <dbl>             <int>       <int> <chr>  <int>
##  1 Adelie  Torgersen           39.1               181        3750 male    2007
##  2 Adelie  Torgersen           39.5               186        3800 female  2007
##  3 Adelie  Torgersen           40.3               195        3250 female  2007
##  4 Adelie  Torgersen           NA                  NA          NA <NA>    2007
##  5 Adelie  Torgersen           36.7               193        3450 female  2007
##  6 Adelie  Torgersen           39.3               190        3650 male    2007
##  7 Adelie  Torgersen           38.9               181        3625 female  2007
##  8 Adelie  Torgersen           39.2               195        4675 male    2007
##  9 Adelie  Torgersen           34.1               193        3475 <NA>    2007
## 10 Adelie  Torgersen           42                 190        4250 <NA>    2007
## # ℹ 334 more rows
## # ℹ 1 more variable: bill_depth_mm <dbl>

This returns a tibble with all the other variables: species, island, flipper_length_mm, body_mass_g, sex and year.

The select function can also be used to grab (or drop) a set of variables that occur in a sequence next to one another. We specify a series of adjacent variables using the : operator. We use this with two variable names, one on the left-hand side and one on the right. When we use : like this, select will subset both those two variables along with any others that fall in between them.

For example, if we want only the morphometric variables (bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g) we could use:

select(penguins, bill_length_mm:body_mass_g)
## # A tibble: 344 × 4
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##             <dbl>         <dbl>             <int>       <int>
##  1           39.1          18.7               181        3750
##  2           39.5          17.4               186        3800
##  3           40.3          18                 195        3250
##  4           NA            NA                  NA          NA
##  5           36.7          19.3               193        3450
##  6           39.3          20.6               190        3650
##  7           38.9          17.8               181        3625
##  8           39.2          19.6               195        4675
##  9           34.1          18.1               193        3475
## 10           42            20.2               190        4250
## # ℹ 334 more rows

The : operator can also be combined with ! if we need to drop a series of variables according to their position in a data frame or tibble. For example, we can use this trick to get the complement of the previous example, i.e. throw away the morphometric variables:

select(penguins, !bill_length_mm:body_mass_g)
## # A tibble: 344 × 4
##    species island    sex     year
##    <chr>   <chr>     <chr>  <int>
##  1 Adelie  Torgersen male    2007
##  2 Adelie  Torgersen female  2007
##  3 Adelie  Torgersen female  2007
##  4 Adelie  Torgersen <NA>    2007
##  5 Adelie  Torgersen female  2007
##  6 Adelie  Torgersen male    2007
##  7 Adelie  Torgersen female  2007
##  8 Adelie  Torgersen male    2007
##  9 Adelie  Torgersen <NA>    2007
## 10 Adelie  Torgersen <NA>    2007
## # ℹ 334 more rows

7.2.2 Renaming variables with select and rename

The select function can also rename variables at the same time as selecting them. To do this, we name the arguments using the name = value construct, where the name of the selected variable is placed on the right-hand side (value), and the new name goes on the left-hand side (name).

For example, to select thespecies, bill_length_mm and bill_depth_mm variables from penguins, and in the process, rename bill_length_mm and bill_depth_mm to BillLength and BillDepth, use:

select(penguins, species, BillLength = bill_length_mm, BillDepth = bill_depth_mm)
## # A tibble: 344 × 3
##    species BillLength BillDepth
##    <chr>        <dbl>     <dbl>
##  1 Adelie        39.1      18.7
##  2 Adelie        39.5      17.4
##  3 Adelie        40.3      18  
##  4 Adelie        NA        NA  
##  5 Adelie        36.7      19.3
##  6 Adelie        39.3      20.6
##  7 Adelie        38.9      17.8
##  8 Adelie        39.2      19.6
##  9 Adelie        34.1      18.1
## 10 Adelie        42        20.2
## # ℹ 334 more rows

Renaming the variables is a common task. What should we do if the only thing we want to achieve is to rename variables, rather than rename and select them? dplyr provides an additional function called rename for exactly this purpose. This function renames some variables while retaining all others. It works like select. For example, to rename bill_length_mm and bill_depth_mm to BillLength and BillDepth but keep all the variables, use:

rename(penguins, BillLength = bill_length_mm, BillDepth = bill_depth_mm)
## # A tibble: 344 × 8
##    species island BillLength BillDepth flipper_length_mm body_mass_g sex    year
##    <chr>   <chr>       <dbl>     <dbl>             <int>       <int> <chr> <int>
##  1 Adelie  Torge…       39.1      18.7               181        3750 male   2007
##  2 Adelie  Torge…       39.5      17.4               186        3800 fema…  2007
##  3 Adelie  Torge…       40.3      18                 195        3250 fema…  2007
##  4 Adelie  Torge…       NA        NA                  NA          NA <NA>   2007
##  5 Adelie  Torge…       36.7      19.3               193        3450 fema…  2007
##  6 Adelie  Torge…       39.3      20.6               190        3650 male   2007
##  7 Adelie  Torge…       38.9      17.8               181        3625 fema…  2007
##  8 Adelie  Torge…       39.2      19.6               195        4675 male   2007
##  9 Adelie  Torge…       34.1      18.1               193        3475 <NA>   2007
## 10 Adelie  Torge…       42        20.2               190        4250 <NA>   2007
## # ℹ 334 more rows

Notice rename also preserves the order of the variables found in the original data.

7.3 Creating variables with mutate

We use mutate to add new variables to a data frame or tibble. This is useful if we need to construct one or more derived variables to support an analysis. Basic usage of mutate looks like this:

mutate(<data>, <expression-1>, <expression-2>, ...)

Again, this is not an example we can run—it’s pseudocode highlighting how to use mutate in abstract terms.

The first argument, <data>, must be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in <data>. We’ve have expressed these as <expression-1>, <expression-2>, where <expression-1> and <expression-2> represent the first two expressions, and the ... is acting as a placeholder for the remaining expressions. These can be any valid R code that refers to variables in <data>. This is often a simple calculation (e.g. involving arithmetic), but it can be arbitrarily complex.

To see mutate in action, let’s construct a new version of penguins that contains one extra variable—body mass measured in kilograms:

mutate(penguins, body_mass_g / 1000)
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 3 more variables: sex <chr>, year <int>, `body_mass_g/1000` <dbl>

This creates a copy of penguins with a new column called body_mass_g/1000 (look at the bottom of the printed output). That is not a very good name but do not worry—we will improve on it in a moment. Most of the rules that apply to select also apply to mutate:

  • Quotes must not be placed around an expression that performs a calculation.  This makes sense because the expression is meant to be evaluated so that it “does something”. It is not a value.
  • The mutate function does not have side effects, meaning it does not change the original penguins in any way. In the example, we printed the result produced by mutate rather than assigning it a name using <-, which means we have no way to access the result.
  • The mutate function returns the same kind of object as the one it is working on: a data frame if our data was originally in a data frame, a tibble if it was a tibble.

Creating a variable called something like body_mass_g/1000 is not ideal because that is a difficult name to work with. Fortunately, the mutate function can name new variables at the same time as it creates them. We just name the arguments using =, placing the name on the left-hand side. Look at how to use this construct to name the new area variable body_mass_kg:

mutate(penguins, body_mass_kg = body_mass_g / 1000)
## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 3 more variables: sex <chr>, year <int>, body_mass_kg <dbl>

We can create more than one variable by supplying mutate multiple (named) arguments:

mutate(penguins, 
       bill_size = bill_depth_mm * bill_length_mm,
       scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 10
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 4 more variables: sex <chr>, year <int>, bill_size <dbl>,
## #   scaled_bill_size <dbl>

Notice that we placed each calculation on a new line, remembering to use a comma to separate arguments. We can do this because R ignores white space. Splitting long a function call across multiple lines in this way is helpful because it makes it easier to read and understand the sequence of calculations.

This last example reveals a nice feature of mutate—we can use newly created variables in further calculations. Here we constructed a synthetic bill size variable, and used that to calculate a second variable representing the ratio of bill size to body mass.

7.3.1 Transforming and dropping variables

Occasionally we need to construct one or more new variables and then drop all the other ones in the original dataset. The transmute function is designed to do this. It works exactly like mutate, but it has a slightly different behaviour:

transmute(penguins, 
          bill_size = bill_depth_mm * bill_length_mm,
          scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 2
##    bill_size scaled_bill_size
##        <dbl>            <dbl>
##  1      731.            0.195
##  2      687.            0.181
##  3      725.            0.223
##  4       NA            NA    
##  5      708.            0.205
##  6      810.            0.222
##  7      692.            0.191
##  8      768.            0.164
##  9      617.            0.178
## 10      848.            0.200
## # ℹ 334 more rows

Here we repeated the previous example, but now only the new variables were retained in the resulting tibble. If we also want to retain additional variables without altering them, we can pass them as unnamed arguments. For example, to retain species identity in the output, use:

transmute(penguins,
          species,
          bill_size = bill_depth_mm * bill_length_mm,
          scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 3
##    species bill_size scaled_bill_size
##    <chr>       <dbl>            <dbl>
##  1 Adelie       731.            0.195
##  2 Adelie       687.            0.181
##  3 Adelie       725.            0.223
##  4 Adelie        NA            NA    
##  5 Adelie       708.            0.205
##  6 Adelie       810.            0.222
##  7 Adelie       692.            0.191
##  8 Adelie       768.            0.164
##  9 Adelie       617.            0.178
## 10 Adelie       848.            0.200
## # ℹ 334 more rows