Chapter 8 Working with observations

8.1 Introduction

This chapter will explore the filter and arrange verbs. We discuss these functions together because they manipulate observations (i.e. rows) of a data frame or tibble:

  • The filter function extracts a subset of observations based on supplied criteria.
  • The arrange function reorders the rows according to the values in one or more variables.

8.1.1 Getting ready

We’ll be using the dplyr package, so we need to remember to load and attach the package in the current session:

library("dplyr")

We’ll use the Palmer penguins data again to illustrate the ideas in this chapter. The examples below assume those data been read into R as a tibble with the name penguins.

8.2 Relational and logical operators

Most filter operations rely on some combination of relational and logical operators. Relational operators allow us to ask questions like, “are the values of ‘x’ greater than those of ‘y’: x > y”. These sorts of comparisons are used by R to express whether or not a particular condition is met (because they generate a logical vector of TRUE/FALSE values). Logical operators allow us to combine such conditions, thereby building up complex conditions from simpler ones.

This is best understood by example. We’ll do that in a moment. For now, simply make a mental note of the different relational and logical operators:

  1. Use relational operators to make comparisons between a pair of variables on the basis of conditions like ‘less than’ or ‘equal to’:

    • x < y: is x less than y?
    • x > y: is x greater than y?
    • x <= y: is x less than or equal to y?
    • x >= y: is x greater than or equal to y?
    • x == y: is x equal to y?
    • x != y: is x not equal to y?
  2. Use logical operators to connect two or more comparisons to arrive at a single overall criterion:

    • x & y: are both x AND y true?
    • x | y: is x OR y true?

Double == or single =?

Remember to use ‘double equals’ == when testing for equivalence between x and y. We all forget this from time to time and use ‘single equals’ = instead. This will lead to an error. dplyr is pretty good at spotting this mistake these days and will warn you in its error message that you used = when you meant to use ==. Of course, if you don’t read the error messages, you won’t benefit from this helpful behaviour.

8.3 Subset observations with filter

We use filter to subset observations in a data frame or tibble containing our data. This is useful when we want to limit an analysis to a particular group of observations. Basic usage of filter looks something like this:

filter(<data>, <expression-1>, <expression-2>, ...)

Yes, this is pseudocode again. Let’s review the arguments:

  • The first argument, <data>, must be the name of the object (usually a data frame or tibble) containing our data. As with all dplyr verbs, this is not optional.
  • We then include one or more additional arguments. Each of these is a valid R expression involving one or more variables in <data> that returns a logical vector. We’ve expressed these as <expression-1>, <expression-2>, ..., where <expression-1> and <expression-2> represent the first two expressions, and the ... is acting as placeholder for the remaining expressions.

To see filter in action, we’ll use it to subset observations in the penguins dataset, based on two relational criteria:

filter(penguins, bill_length_mm > 45, bill_depth_mm > 18)
## # A tibble: 44 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>     <chr>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           46            21.5               194        4200
##  2 Adelie    Torgers…           45.8          18.9               197        4150
##  3 Adelie    Biscoe             45.6          20.3               191        4600
##  4 Chinstrap Dream              50            19.5               196        3900
##  5 Chinstrap Dream              51.3          19.2               193        3650
##  6 Chinstrap Dream              45.4          18.7               188        3525
##  7 Chinstrap Dream              52.7          19.8               197        3725
##  8 Chinstrap Dream              46.1          18.2               178        3250
##  9 Chinstrap Dream              51.3          18.2               197        3750
## 10 Chinstrap Dream              46            18.9               195        4150
## # ℹ 34 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

In this example, we’ve created a subset of penguins that only includes observations where the bill_length_mm variable is greater than 45 and the bill_depth_mm variable is greater than 45, i.e. both conditions must be met for an observation to be retained. This is probably starting to feel repetitious, but there are a few features of filter that we should be aware of:

  • We do not surround each expression with quotes. The expression is meant to be evaluated—it is not ’a value.
  • The result produced by filter was printed to the Console in the example. The filter function did not change the original penguins in any way (no side effects!).
  • The filter function will return the same kind of data object it is working on: it returns a data frame if our data was originally in a data frame, and a tibble if it was a tibble.

Notice that including are two conditions separated by a comma means both conditions have to be met. There is another way to achieve the exact same result:

filter(penguins, bill_length_mm > 45 & bill_depth_mm > 18)
## # A tibble: 44 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>     <chr>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           46            21.5               194        4200
##  2 Adelie    Torgers…           45.8          18.9               197        4150
##  3 Adelie    Biscoe             45.6          20.3               191        4600
##  4 Chinstrap Dream              50            19.5               196        3900
##  5 Chinstrap Dream              51.3          19.2               193        3650
##  6 Chinstrap Dream              45.4          18.7               188        3525
##  7 Chinstrap Dream              52.7          19.8               197        3725
##  8 Chinstrap Dream              46.1          18.2               178        3250
##  9 Chinstrap Dream              51.3          18.2               197        3750
## 10 Chinstrap Dream              46            18.9               195        4150
## # ℹ 34 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

This version links the two parts with the logical & operator. That is, rather than supplying bill_length_mm > 45 and bill_depth_mm > 18 as two arguments, we used a single R expression, combining them with the &.

We’re pointing this out because we sometimes need to create filtering criteria that cannot be expressed as ‘condition 1’ and ‘condition 2’ and ‘condition 3’… etc. Under those conditions we have to use logical operators to connect conditions. A simple instance of this situation is where we need to subset on an either/or basis. For example:

filter(penguins, bill_length_mm < 36 | bill_length_mm > 54)
## # A tibble: 29 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           34.1          18.1               193        3475
##  2 Adelie  Torgersen           34.6          21.1               198        4400
##  3 Adelie  Torgersen           34.4          18.4               184        3325
##  4 Adelie  Biscoe              35.9          19.2               189        3800
##  5 Adelie  Biscoe              35.3          18.9               187        3800
##  6 Adelie  Biscoe              35            17.9               190        3450
##  7 Adelie  Biscoe              34.5          18.1               187        2900
##  8 Adelie  Biscoe              35.7          16.9               185        3150
##  9 Adelie  Biscoe              35.5          16.2               195        3350
## 10 Adelie  Torgersen           35.9          16.6               190        3050
## # ℹ 19 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

This creates a subset of penguins that only includes observation where bill_length_mm is less than 36 or (|) greater than 54. This creates a subset of the data associated with the more ‘extreme’ values of bill length (unusually small or large).

We’re not limited to using relational and logical operators when working with filter. The conditions specified in the filter function can be any expression that returns a logical vector. The only constraint is that the output vector’s length has to equal its input’s length, or be a single logical values (TRUE or FALSE).

Here’s an example. The dplyr between function is used to determine whether the values of a numeric vector fall in a specified range. It has three arguments: the numeric vector to filter on and the lower and upper and boundary values. For example:

filter(penguins, between(bill_length_mm, 36, 54))
## # A tibble: 313 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           42            20.2               190        4250
##  9 Adelie  Torgersen           37.8          17.1               186        3300
## 10 Adelie  Torgersen           37.8          17.3               180        3700
## # ℹ 303 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

8.4 Reordering observations with arrange

We use arrange to reorder the rows of a data frame or tibble. Basic usage of arrange looks like this:

arrange(<data>, <variable-1>, <variable-2>, ...)

Yes, this is pseudocode. As always, the first argument, <data>, is the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is the name of a variable in <data>: <variable-1> and <variable-2> are names of the first two ordering variables, and the ... is acting as a placeholder for the remaining variables.

To see arrange in action, let’s construct a new version of penguins where the rows have been reordered first by flipper_length_mm, and then by body_mass_g:

arrange(penguins, flipper_length_mm, body_mass_g)
## # A tibble: 344 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>     <chr>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Biscoe             37.9          18.6               172        3150
##  2 Adelie    Biscoe             37.8          18.3               174        3400
##  3 Adelie    Torgers…           40.2          17                 176        3450
##  4 Adelie    Dream              33.1          16.1               178        2900
##  5 Adelie    Dream              39.5          16.7               178        3250
##  6 Chinstrap Dream              46.1          18.2               178        3250
##  7 Adelie    Dream              37.2          18.1               178        3900
##  8 Adelie    Dream              37.5          18.9               179        2975
##  9 Adelie    Dream              42.2          18.5               180        3550
## 10 Adelie    Biscoe             37.7          18.7               180        3600
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

This creates a new version of penguins where the rows are sorted according to the values of by flipper_length_mm and body_mass_g in ascending order – i.e. from smallest to largest. Look at the cases where flipper length is 178 mm. What do these show? Since flipper_length_mm was placed before body_mass_g in the arguments, the values of body_mass_g are only used to break ties within any particular value of flipper_length_mm.

For the sake of avoiding doubt about how arrange works, we will quickly review its behaviour. It works the same as every other dplyr verb we have looked at:

  • The variable names used as arguments of arrange are not surrounded by quotes.
  • The arrange function did not change the original penguins in any way.
  • The arrange function will return the same kind of data object it is working on.

arrange sorts variables in ascending order by default. If we need it to sort a variable in descending order, we wrap the variable name in the dplyr desc function:

arrange(penguins, flipper_length_mm, desc(body_mass_g))
## # A tibble: 344 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>     <chr>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Biscoe             37.9          18.6               172        3150
##  2 Adelie    Biscoe             37.8          18.3               174        3400
##  3 Adelie    Torgers…           40.2          17                 176        3450
##  4 Adelie    Dream              37.2          18.1               178        3900
##  5 Adelie    Dream              39.5          16.7               178        3250
##  6 Chinstrap Dream              46.1          18.2               178        3250
##  7 Adelie    Dream              33.1          16.1               178        2900
##  8 Adelie    Dream              37.5          18.9               179        2975
##  9 Adelie    Biscoe             40.5          18.9               180        3950
## 10 Adelie    Biscoe             38.8          17.2               180        3800
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <chr>, year <int>

This creates a new version of penguins where the rows are sorted according to the values of flipper_length_mm and body_mass_g, in ascending and descending order, respectively. Look carefully at the values in the flipper_length_mm and body_mass_g columns to see the difference between this example and the previous one.