Chapter 7 Working with variables
7.1 Introduction
This chapter will explore the dplyr select
and mutate
verbs, and the closely related rename
and transmute
verbs. We consider these functions together because they operate on the variables (i.e. the columns) of a data frame or tibble:
- The
select
function selects a subset of variables to retain and (optionally) renames them in the process. - The
mutate
function creates new variables from pre-existing ones and retains the original variables. - The
rename
function renames one or more variables while keeping the remaining variable names unchanged. - The
transmute
function creates new variables from pre-existing ones and drops the original variables.
7.1.1 Getting ready
Obviously, we need to have first installed dplyr package to use it. Assuming that’s been done, we need to load and attach the package in the current session:
library("dplyr")
We will use the Palmer penguins data to illustrate the ideas in this chapter. Remember—the previous chapter described this data set and explained where to find it. The examples below assume it was read into R as a tibble with the name penguins
.
7.2 Subset variables with select
We use select
to select variables from a data frame or tibble. This is used when we have a data set with many variables but only need to work with a subset of these. Basic usage of select
looks like this:
select(<data>, <variable-1>, <variable-2>, ...)
Remember—this is not an example we can run. This is a pseudocode designed to provide a generic description of how we use select
. Let’s look at the arguments of select:
- The first argument,
<data>
, must be the name of the object containing our data (usually a data frame or tibble). This is not optional—dplyr functions only exist to manipulate data. - We then include a series of one or more additional arguments, where each one is the name of a variable in
<data>
. We’ve expressed this as<variable-1>, <variable-2>, ...
, where<variable-1>
and<variable-2>
are names of the first two variables. The...
is acting as a placeholder for the remaining variables. There could be any number of these.
It’s easiest to understand how a function like select
works by seeing it in action. We select the species
, bill_length_mm
and bill_depth_mm
variables from penguins
like this:
select(penguins, species, bill_length_mm, bill_depth_mm)
## # A tibble: 344 × 3
## species bill_length_mm bill_depth_mm
## <chr> <dbl> <dbl>
## 1 Adelie 39.1 18.7
## 2 Adelie 39.5 17.4
## 3 Adelie 40.3 18
## 4 Adelie NA NA
## 5 Adelie 36.7 19.3
## 6 Adelie 39.3 20.6
## 7 Adelie 38.9 17.8
## 8 Adelie 39.2 19.6
## 9 Adelie 34.1 18.1
## 10 Adelie 42 20.2
## # ℹ 334 more rows
Hopefully, nothing about this example is too surprising. However, there are a few subtleties buried in that example:
- The
select
function is designed to work in a non-standard way which means variable names should not be surrounded by quotes. The one exception is when a name has a space in it. Under those circumstances, it has to be quoted with backticks, e.g.`variable 1`
. - The
select
function does not have ‘side effects’. This means is that it does not change the originalpenguins
object. We printed the result produced byselect
to the Console, so we can’t access the modified data set. If we need to use the result, we have to assign it a name using<-
. - The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied to the argument list. This means we can reorder variables at the same time as selecting them if we need to.
- The
select
function will return the same kind of data object we give it to work on. It returns a data frame if our data was in a data frame and a tibble if it was a tibble. In this example, R prints a tibble becausepenguins
was a tibble.
That second point is important—we have to remember to assign the result a name using <-
if we want to keep it and use it later. For example, we might call the result of that last example penguins_bill
:
<- select(penguins, species, bill_length_mm, bill_depth_mm) penguins_bill
Now that we’ve named the new data set created by select
we can refer to it by that name whenever we need it:
penguins_bill
## # A tibble: 344 × 3
## species bill_length_mm bill_depth_mm
## <chr> <dbl> <dbl>
## 1 Adelie 39.1 18.7
## 2 Adelie 39.5 17.4
## 3 Adelie 40.3 18
## 4 Adelie NA NA
## 5 Adelie 36.7 19.3
## 6 Adelie 39.3 20.6
## 7 Adelie 38.9 17.8
## 8 Adelie 39.2 19.6
## 9 Adelie 34.1 18.1
## 10 Adelie 42 20.2
## # ℹ 334 more rows
Remember—the original penguins
data is completely unchanged:
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
7.2.1 Alternative ways to identify variables with select
It’s sometimes more convenient to use select
to subset variables by specifying those we do not need, rather than specifying of the ones to keep. We can use the !
operator to indicate that certain variables should be dropped. For example, to get rid of the bill_depth_mm
and bill_length_mm
columns, we could use:
select(penguins, !bill_depth_mm, !bill_length_mm)
## # A tibble: 344 × 8
## species island bill_length_mm flipper_length_mm body_mass_g sex year
## <chr> <chr> <dbl> <int> <int> <chr> <int>
## 1 Adelie Torgersen 39.1 181 3750 male 2007
## 2 Adelie Torgersen 39.5 186 3800 female 2007
## 3 Adelie Torgersen 40.3 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 193 3450 female 2007
## 6 Adelie Torgersen 39.3 190 3650 male 2007
## 7 Adelie Torgersen 38.9 181 3625 female 2007
## 8 Adelie Torgersen 39.2 195 4675 male 2007
## 9 Adelie Torgersen 34.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 190 4250 <NA> 2007
## # ℹ 334 more rows
## # ℹ 1 more variable: bill_depth_mm <dbl>
This returns a tibble with all the other variables: species
, island
, flipper_length_mm
, body_mass_g
, sex
and year
.
The select
function can also be used to grab (or drop) a set of variables that occur in a sequence next to one another. We specify a series of adjacent variables using the :
operator. We use this with two variable names, one on the left-hand side and one on the right. When we use :
like this, select
will subset both those two variables along with any others that fall in between them.
For example, if we want only the morphometric variables (bill_length_mm
, bill_depth_mm
, flipper_length_mm
and body_mass_g
) we could use:
select(penguins, bill_length_mm:body_mass_g)
## # A tibble: 344 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <int> <int>
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
## 7 38.9 17.8 181 3625
## 8 39.2 19.6 195 4675
## 9 34.1 18.1 193 3475
## 10 42 20.2 190 4250
## # ℹ 334 more rows
The :
operator can also be combined with !
if we need to drop a series of variables according to their position in a data frame or tibble. For example, we can use this trick to get the complement of the previous example, i.e. throw away the morphometric variables:
select(penguins, !bill_length_mm:body_mass_g)
## # A tibble: 344 × 4
## species island sex year
## <chr> <chr> <chr> <int>
## 1 Adelie Torgersen male 2007
## 2 Adelie Torgersen female 2007
## 3 Adelie Torgersen female 2007
## 4 Adelie Torgersen <NA> 2007
## 5 Adelie Torgersen female 2007
## 6 Adelie Torgersen male 2007
## 7 Adelie Torgersen female 2007
## 8 Adelie Torgersen male 2007
## 9 Adelie Torgersen <NA> 2007
## 10 Adelie Torgersen <NA> 2007
## # ℹ 334 more rows
7.2.2 Renaming variables with select
and rename
The select
function can also rename variables at the same time as selecting them. To do this, we name the arguments using the name = value
construct, where the name of the selected variable is placed on the right-hand side (value
), and the new name goes on the left-hand side (name
).
For example, to select thespecies
, bill_length_mm
and bill_depth_mm
variables from penguins
, and in the process, rename bill_length_mm
and bill_depth_mm
to BillLength
and BillDepth
, use:
select(penguins, species, BillLength = bill_length_mm, BillDepth = bill_depth_mm)
## # A tibble: 344 × 3
## species BillLength BillDepth
## <chr> <dbl> <dbl>
## 1 Adelie 39.1 18.7
## 2 Adelie 39.5 17.4
## 3 Adelie 40.3 18
## 4 Adelie NA NA
## 5 Adelie 36.7 19.3
## 6 Adelie 39.3 20.6
## 7 Adelie 38.9 17.8
## 8 Adelie 39.2 19.6
## 9 Adelie 34.1 18.1
## 10 Adelie 42 20.2
## # ℹ 334 more rows
Renaming the variables is a common task. What should we do if the only thing we want to achieve is to rename variables, rather than rename and select them? dplyr provides an additional function called rename
for exactly this purpose. This function renames some variables while retaining all others. It works like select
. For example, to rename bill_length_mm
and bill_depth_mm
to BillLength
and BillDepth
but keep all the variables, use:
rename(penguins, BillLength = bill_length_mm, BillDepth = bill_depth_mm)
## # A tibble: 344 × 8
## species island BillLength BillDepth flipper_length_mm body_mass_g sex year
## <chr> <chr> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Adelie Torge… 39.1 18.7 181 3750 male 2007
## 2 Adelie Torge… 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torge… 40.3 18 195 3250 fema… 2007
## 4 Adelie Torge… NA NA NA NA <NA> 2007
## 5 Adelie Torge… 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torge… 39.3 20.6 190 3650 male 2007
## 7 Adelie Torge… 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torge… 39.2 19.6 195 4675 male 2007
## 9 Adelie Torge… 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torge… 42 20.2 190 4250 <NA> 2007
## # ℹ 334 more rows
Notice rename
also preserves the order of the variables found in the original data.
7.3 Creating variables with mutate
We use mutate
to add new variables to a data frame or tibble. This is useful if we need to construct one or more derived variables to support an analysis. Basic usage of mutate
looks like this:
mutate(<data>, <expression-1>, <expression-2>, ...)
Again, this is not an example we can run—it’s pseudocode highlighting how to use mutate
in abstract terms.
The first argument, <data>
, must be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in <data>
. We’ve have expressed these as <expression-1>, <expression-2>
, where <expression-1>
and <expression-2>
represent the first two expressions, and the ...
is acting as a placeholder for the remaining expressions. These can be any valid R code that refers to variables in <data>
. This is often a simple calculation (e.g. involving arithmetic), but it can be arbitrarily complex.
To see mutate
in action, let’s construct a new version of penguins
that contains one extra variable—body mass measured in kilograms:
mutate(penguins, body_mass_g / 1000)
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 3 more variables: sex <chr>, year <int>, `body_mass_g/1000` <dbl>
This creates a copy of penguins
with a new column called body_mass_g/1000
(look at the bottom of the printed output). That is not a very good name but do not worry—we will improve on it in a moment. Most of the rules that apply to select
also apply to mutate
:
- Quotes must not be placed around an expression that performs a calculation. This makes sense because the expression is meant to be evaluated so that it “does something”. It is not a value.
- The
mutate
function does not have side effects, meaning it does not change the originalpenguins
in any way. In the example, we printed the result produced bymutate
rather than assigning it a name using<-
, which means we have no way to access the result. - The
mutate
function returns the same kind of object as the one it is working on: a data frame if our data was originally in a data frame, a tibble if it was a tibble.
Creating a variable called something like body_mass_g/1000
is not ideal because that is a difficult name to work with. Fortunately, the mutate
function can name new variables at the same time as it creates them. We just name the arguments using =
, placing the name on the left-hand side. Look at how to use this construct to name the new area variable body_mass_kg
:
mutate(penguins, body_mass_kg = body_mass_g / 1000)
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 3 more variables: sex <chr>, year <int>, body_mass_kg <dbl>
We can create more than one variable by supplying mutate
multiple (named) arguments:
mutate(penguins,
bill_size = bill_depth_mm * bill_length_mm,
scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 10
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 4 more variables: sex <chr>, year <int>, bill_size <dbl>,
## # scaled_bill_size <dbl>
Notice that we placed each calculation on a new line, remembering to use a comma to separate arguments. We can do this because R ignores white space. Splitting long a function call across multiple lines in this way is helpful because it makes it easier to read and understand the sequence of calculations.
This last example reveals a nice feature of mutate
—we can use newly created variables in further calculations. Here we constructed a synthetic bill size variable, and used that to calculate a second variable representing the ratio of bill size to body mass.
7.3.1 Transforming and dropping variables
Occasionally we need to construct one or more new variables and then drop all the other ones in the original dataset. The transmute
function is designed to do this. It works exactly like mutate
, but it has a slightly different behaviour:
transmute(penguins,
bill_size = bill_depth_mm * bill_length_mm,
scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 2
## bill_size scaled_bill_size
## <dbl> <dbl>
## 1 731. 0.195
## 2 687. 0.181
## 3 725. 0.223
## 4 NA NA
## 5 708. 0.205
## 6 810. 0.222
## 7 692. 0.191
## 8 768. 0.164
## 9 617. 0.178
## 10 848. 0.200
## # ℹ 334 more rows
Here we repeated the previous example, but now only the new variables were retained in the resulting tibble. If we also want to retain additional variables without altering them, we can pass them as unnamed arguments. For example, to retain species
identity in the output, use:
transmute(penguins,
species,bill_size = bill_depth_mm * bill_length_mm,
scaled_bill_size = bill_size / body_mass_g)
## # A tibble: 344 × 3
## species bill_size scaled_bill_size
## <chr> <dbl> <dbl>
## 1 Adelie 731. 0.195
## 2 Adelie 687. 0.181
## 3 Adelie 725. 0.223
## 4 Adelie NA NA
## 5 Adelie 708. 0.205
## 6 Adelie 810. 0.222
## 7 Adelie 692. 0.191
## 8 Adelie 768. 0.164
## 9 Adelie 617. 0.178
## 10 Adelie 848. 0.200
## # ℹ 334 more rows