Chapter 8 Working with observations
8.1 Introduction
This chapter will explore the filter
and arrange
verbs. We discuss these functions together because they manipulate observations (i.e. rows) of a data frame or tibble:
- The
filter
function extracts a subset of observations based on supplied criteria. - The
arrange
function reorders the rows according to the values in one or more variables.
8.1.1 Getting ready
We’ll be using the dplyr package, so we need to remember to load and attach the package in the current session:
library("dplyr")
We’ll use the Palmer penguins data again to illustrate the ideas in this chapter. The examples below assume those data been read into R as a tibble with the name penguins
.
8.2 Relational and logical operators
Most filter
operations rely on some combination of relational and logical operators. Relational operators allow us to ask questions like, “are the values of ‘x’ greater than those of ‘y’: x > y
”. These sorts of comparisons are used by R to express whether or not a particular condition is met (because they generate a logical vector of TRUE/FALSE values). Logical operators allow us to combine such conditions, thereby building up complex conditions from simpler ones.
This is best understood by example. We’ll do that in a moment. For now, simply make a mental note of the different relational and logical operators:
Use relational operators to make comparisons between a pair of variables on the basis of conditions like ‘less than’ or ‘equal to’:
x < y
: is x less than y?x > y
: is x greater than y?x <= y
: is x less than or equal to y?x >= y
: is x greater than or equal to y?x == y
: is x equal to y?x != y
: is x not equal to y?
Use logical operators to connect two or more comparisons to arrive at a single overall criterion:
x & y
: are both x AND y true?x | y
: is x OR y true?
Double ==
or single =
?
Remember to use ‘double equals’ ==
when testing for equivalence between x
and y
. We all forget this from time to time and use ‘single equals’ =
instead. This will lead to an error. dplyr is pretty good at spotting this mistake these days and will warn you in its error message that you used =
when you meant to use ==
. Of course, if you don’t read the error messages, you won’t benefit from this helpful behaviour.
8.3 Subset observations with filter
We use filter
to subset observations in a data frame or tibble containing our data. This is useful when we want to limit an analysis to a particular group of observations. Basic usage of filter
looks something like this:
filter(<data>, <expression-1>, <expression-2>, ...)
Yes, this is pseudocode again. Let’s review the arguments:
- The first argument,
<data>
, must be the name of the object (usually a data frame or tibble) containing our data. As with all dplyr verbs, this is not optional. - We then include one or more additional arguments. Each of these is a valid R expression involving one or more variables in
<data>
that returns a logical vector. We’ve expressed these as<expression-1>, <expression-2>, ...
, where<expression-1>
and<expression-2>
represent the first two expressions, and the...
is acting as placeholder for the remaining expressions.
To see filter
in action, we’ll use it to subset observations in the penguins
dataset, based on two relational criteria:
filter(penguins, bill_length_mm > 45, bill_depth_mm > 18)
## # A tibble: 44 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 46 21.5 194 4200
## 2 Adelie Torgers… 45.8 18.9 197 4150
## 3 Adelie Biscoe 45.6 20.3 191 4600
## 4 Chinstrap Dream 50 19.5 196 3900
## 5 Chinstrap Dream 51.3 19.2 193 3650
## 6 Chinstrap Dream 45.4 18.7 188 3525
## 7 Chinstrap Dream 52.7 19.8 197 3725
## 8 Chinstrap Dream 46.1 18.2 178 3250
## 9 Chinstrap Dream 51.3 18.2 197 3750
## 10 Chinstrap Dream 46 18.9 195 4150
## # ℹ 34 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
In this example, we’ve created a subset of penguins
that only includes observations where the bill_length_mm
variable is greater than 45 and the bill_depth_mm
variable is greater than 45, i.e. both conditions must be met for an observation to be retained. This is probably starting to feel repetitious, but there are a few features of filter
that we should be aware of:
- We do not surround each expression with quotes. The expression is meant to be evaluated—it is not ’a value.
- The result produced by
filter
was printed to the Console in the example. Thefilter
function did not change the originalpenguins
in any way (no side effects!). - The
filter
function will return the same kind of data object it is working on: it returns a data frame if our data was originally in a data frame, and a tibble if it was a tibble.
Notice that including are two conditions separated by a comma means both conditions have to be met. There is another way to achieve the exact same result:
filter(penguins, bill_length_mm > 45 & bill_depth_mm > 18)
## # A tibble: 44 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgers… 46 21.5 194 4200
## 2 Adelie Torgers… 45.8 18.9 197 4150
## 3 Adelie Biscoe 45.6 20.3 191 4600
## 4 Chinstrap Dream 50 19.5 196 3900
## 5 Chinstrap Dream 51.3 19.2 193 3650
## 6 Chinstrap Dream 45.4 18.7 188 3525
## 7 Chinstrap Dream 52.7 19.8 197 3725
## 8 Chinstrap Dream 46.1 18.2 178 3250
## 9 Chinstrap Dream 51.3 18.2 197 3750
## 10 Chinstrap Dream 46 18.9 195 4150
## # ℹ 34 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
This version links the two parts with the logical &
operator. That is, rather than supplying bill_length_mm > 45
and bill_depth_mm > 18
as two arguments, we used a single R expression, combining them with the &
.
We’re pointing this out because we sometimes need to create filtering criteria that cannot be expressed as ‘condition 1’ and ‘condition 2’ and ‘condition 3’… etc. Under those conditions we have to use logical operators to connect conditions. A simple instance of this situation is where we need to subset on an either/or basis. For example:
filter(penguins, bill_length_mm < 36 | bill_length_mm > 54)
## # A tibble: 29 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 34.1 18.1 193 3475
## 2 Adelie Torgersen 34.6 21.1 198 4400
## 3 Adelie Torgersen 34.4 18.4 184 3325
## 4 Adelie Biscoe 35.9 19.2 189 3800
## 5 Adelie Biscoe 35.3 18.9 187 3800
## 6 Adelie Biscoe 35 17.9 190 3450
## 7 Adelie Biscoe 34.5 18.1 187 2900
## 8 Adelie Biscoe 35.7 16.9 185 3150
## 9 Adelie Biscoe 35.5 16.2 195 3350
## 10 Adelie Torgersen 35.9 16.6 190 3050
## # ℹ 19 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
This creates a subset of penguins
that only includes observation where bill_length_mm
is less than 36 or (|
) greater than 54. This creates a subset of the data associated with the more ‘extreme’ values of bill length (unusually small or large).
We’re not limited to using relational and logical operators when working with filter
. The conditions specified in the filter
function can be any expression that returns a logical vector. The only constraint is that the output vector’s length has to equal its input’s length, or be a single logical values (TRUE
or FALSE
).
Here’s an example. The dplyr between
function is used to determine whether the values of a numeric vector fall in a specified range. It has three arguments: the numeric vector to filter on and the lower and upper and boundary values. For example:
filter(penguins, between(bill_length_mm, 36, 54))
## # A tibble: 313 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 42 20.2 190 4250
## 9 Adelie Torgersen 37.8 17.1 186 3300
## 10 Adelie Torgersen 37.8 17.3 180 3700
## # ℹ 303 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
8.4 Reordering observations with arrange
We use arrange
to reorder the rows of a data frame or tibble. Basic usage of arrange
looks like this:
arrange(<data>, <variable-1>, <variable-2>, ...)
Yes, this is pseudocode. As always, the first argument, <data>
, is the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is the name of a variable in <data>
: <variable-1>
and <variable-2>
are names of the first two ordering variables, and the ...
is acting as a placeholder for the remaining variables.
To see arrange
in action, let’s construct a new version of penguins
where the rows have been reordered first by flipper_length_mm
, and then by body_mass_g
:
arrange(penguins, flipper_length_mm, body_mass_g)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 37.9 18.6 172 3150
## 2 Adelie Biscoe 37.8 18.3 174 3400
## 3 Adelie Torgers… 40.2 17 176 3450
## 4 Adelie Dream 33.1 16.1 178 2900
## 5 Adelie Dream 39.5 16.7 178 3250
## 6 Chinstrap Dream 46.1 18.2 178 3250
## 7 Adelie Dream 37.2 18.1 178 3900
## 8 Adelie Dream 37.5 18.9 179 2975
## 9 Adelie Dream 42.2 18.5 180 3550
## 10 Adelie Biscoe 37.7 18.7 180 3600
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
This creates a new version of penguins
where the rows are sorted according to the values of by flipper_length_mm
and body_mass_g
in ascending order – i.e. from smallest to largest. Look at the cases where flipper length is 178 mm. What do these show? Since flipper_length_mm
was placed before body_mass_g
in the arguments, the values of body_mass_g
are only used to break ties within any particular value of flipper_length_mm
.
For the sake of avoiding doubt about how arrange
works, we will quickly review its behaviour. It works the same as every other dplyr verb we have looked at:
- The variable names used as arguments of
arrange
are not surrounded by quotes. - The
arrange
function did not change the originalpenguins
in any way. - The
arrange
function will return the same kind of data object it is working on.
arrange
sorts variables in ascending order by default. If we need it to sort a variable in descending order, we wrap the variable name in the dplyr desc
function:
arrange(penguins, flipper_length_mm, desc(body_mass_g))
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 37.9 18.6 172 3150
## 2 Adelie Biscoe 37.8 18.3 174 3400
## 3 Adelie Torgers… 40.2 17 176 3450
## 4 Adelie Dream 37.2 18.1 178 3900
## 5 Adelie Dream 39.5 16.7 178 3250
## 6 Chinstrap Dream 46.1 18.2 178 3250
## 7 Adelie Dream 33.1 16.1 178 2900
## 8 Adelie Dream 37.5 18.9 179 2975
## 9 Adelie Biscoe 40.5 18.9 180 3950
## 10 Adelie Biscoe 38.8 17.2 180 3800
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>
This creates a new version of penguins
where the rows are sorted according to the values of flipper_length_mm
and body_mass_g
, in ascending and descending order, respectively. Look carefully at the values in the flipper_length_mm
and body_mass_g
columns to see the difference between this example and the previous one.