Chapter 5 Packages

5.1 The R package system

The R package system is the most important single factor driving increased adoption of R. Packages are used to extend the basic capabilities of R. In his book about R packages Hadley Wickam says,

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

An R package is a collection of folders and files in a standard, well-defined format that bundles together computer code, data, and documentation in a way that is easy to use and share with other users. The computer code might all be R code, but it can also include code written in other languages. Packages provide an R-friendly interface to use this “foreign” code without needing to understand how it works.

The base R distribution it comes with quite a few pre-installed packages. These base R packages represent a tiny subset of all available R packages. The majority of these are hosted on a worldwide network of web servers collectively know as CRAN: the Comprehensive R Archive Network, pronounced either “see-ran” or “kran”.

CRAN is a fairly spartan web site, so it’s easy to navigate. The landing page has about a dozen links on the right-hand side. Under the Software section there is a link called Packages. Near the top of that packages page there is a link called Table of available packages, sorted by name that points to a very long list of all the packages on CRAN. The column on the left shows each package name, followed by a brief description of what the package does on the right. There are 1000s of packages listed there.

5.2 Task views

The huge list of packages on the available packages is pretty overwhelming. A more user-friendly view of many R packages is found on the Task Views page (the link is on the left hand side, under the section labelled CRAN). A Task View is basically a curated guide to the packages and functions that are useful for certain disciplines. The Task Views page shows a list of these discipline-specific topics, along with a brief description. For example—

  • The Environmentrics Task View contains information about using R to analyse ecological and environmental data.
  • The Clinical Trials Task View contains information about using R to design, monitor, and analyse data from clinical trials.
  • The Medical Image Analysis Task View contains information about packages for working with commercial medical image data.
  • The Pharmacokinetic Task View contains information about packages for working with pharmacokinetic (PK) data.

Task views are often a good place to start looking for a new package to support a particular analysis in future projects.

5.3 Using packages

Two things need to happen to make use of a package. First, we need to copy the folders and files that make up the package to an appropriate location on our computer. This process is called installing the package. Second, we need to load and attach the package for use in an R session. The word “session” refers to the time between when we start up R and close it down again.

It’s worth unpacking these two ideas because packages are a very frequent source of confusion for new users:

  • If we don’t have a copy of a package’s folders and files our computer, we can’t use it. The process of making this copy is called installing the package. It’s possible to manually install packages by going to the CRAN website, downloading the package, and then using various tools to install it. We don’t recommend using this approach though, because it’s inefficient and error-prone. Instead, use built-in R functions to grab the package from CRAN and install it one step.

  • Once we have a copy of the package on our computer, it will remain there for us to use. We don’t need to re-install a package we want to use every time we start a new R session. It is worth saying that again, there is no need to install a package every time we start up R / RStudio. The only exception to this rule is that a major update to R will sometimes require a complete re-install of the packages. Such major updates are infrequent.

  • Installing a package does nothing more than place a copy of the relevant files on our hard drive. If we want to use the functions or the data that comes with a package we need to make them available in our current R session. Unlike package installation, this load and attach process as it’s known has to be repeated every time we restart R. If we forget to load up the package we can’t use it.

5.3.1 Viewing installed packages

We sometimes need to check whether a package is currently installed. RStudio provides a simple, intuitive way to see which packages are installed on our computer. The Packages tab in the bottom right pane shows the name of every installed package, a brief description and a version number.

There are also a few R functions that can be used to check whether a package is currently installed. For example, the find.package function can do this:

find.package("MASS")
## [1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library/MASS"

The find.package function either prints a “file path” showing us where the package is located, as above, or return an error if the package can’t be found. Alternatively, a function called installed.packages will return a data frame containing a lot of information about the installed packages.

5.3.2 Installing packages

R packages can be installed from several different sources. For example, they can be installed from a local file on a computer, from the CRAN repository, or from an other kind of online repository called Github. Although various alternatives to CRAN are becoming more popular, we’re only going to worry about installing packages that live on CRAN.

To install a package from an online repository like CRAN we have to download the package files, uncompress them (like we would a ZIP file), and move them to the correct location. All of this can be done using a single function: install.packages. For example, to install a package called fortunes we use:

install.packages("fortunes")

The quotes are necessary by the way. If everything is working—we have an active internet connection, the package name is valid, and so on—R will briefly pause while it communicates with the CRAN servers, we should see some red text reporting back what’s happening, and then we’re returned to the prompt. The red text is just letting us know what R is up to. As long as this text does not include the word “error”, there is usually no need to worry about it.

There are a couple of things to keep in mind. First, package names are case sensitive. For example, fortunes is not the same as Fortunes. Quite often package installations fail because we used the wrong case somewhere in the package name. The other aspect of packages we need to know about is related to dependencies: some packages rely on other packages in order to work properly. By default install.packages will install these dependencies, so we don’t usually have to worry too much about them. Just don’t be surprised if the install.packages function installs more than one package when only one was requested.

RStudio provides a way of interacting with install.packages via point-and-click. The Packages tab has an “Install”” button at the top right. Clicking on this brings up a small window with three main fields: “Install from”, “Packages”, and “Install to Library”. We only need to work with the “Packages” field – the other two can be left at their defaults. When we start typing in the first few letters of a package name (e.g. dplyr) RStudio provides a list of available packages that match this. After we select the one we want and click the “Install” button, RStudio invokes install.packages for us.

5.3.3 Loading and attaching packages

Once we’ve installed a package or two, we’ll probably want to use them. Two things have to happen to access a package’s facilities: the package has to be loaded into memory, and then it has to attached to something called a search path so that R can find it.

It is beyond the scope of this book to get in to “how” and “why” of these events. Fortunately, there’s no need to worry about these details, as both loading and attaching can be done in a single step with a function called library. The library function works as we might expect it to. If we want to start using the fortunes package—which was just installed above—all we need is:

library("fortunes")

Nothing much happens if everything is working as it should. R just returns us to the prompt without printing anything to the Console. The difference is that now we can use the functions that fortunes provides. As it turns out, there is only one function, called fortune:

fortune()
## 
## Friends don't let friends use Excel for statistics!
##    -- Jonathan D. Cryer (about problems with using Microsoft Excel for
##       statistics)
##       JSM 2001, Atlanta (August 2001)

The fortunes package is either very useful or utterly pointless, depending on one’s perspective. It dispenses quotes from various R experts delivered to the venerable R mailing list.

5.3.4 Don’t use RStudio for loading packages!

As usual, if we don’t like working in the Console RStudio can help us out. There is a small button next to each package listed in the Packages tab. Packages that have been loaded and attached have a blue check box next to them, whereas this is absent from those that have not. Clicking on an empty check box will load up the package. We mention this because at some point most people realise they can use RStudio to load and attach packages. We don’t recommend using this route. It’s much better to put library statements into your R script. Read the relevant appendix if you’re not sure what a script is yet.

5.3.5 An analogy

The package system frequently confuses new users. This stems from the fact that they aren’t clear about what the install.packages and library functions are doing. One way to think about these is by analogy with smartphone “Apps”. Think of an R package as analogous to a smartphone App— a package effectively extends what R can do, just as an App extends what a phone can do.

When we want to use a new App, we download it from the App store and install it on our phone. Once downloaded, the App lives permanently on the phone and can be used whenever it’s needed. Downloading and installing the App is something we only have to do once. Packages are no different. When we want to use an R package we first have to make sure it is installed on the computer (e.g. using install.packages). Installing a package is a ‘do once’ operation. Once installed, we don’t need to install a package again each time we restart R.

To actually use an App on our phone we open it up by tapping on its icon. This has to happen every time we want to use the App. The package equivalent of opening a smartphone App is the “load and attach” operation. This is what library does. It makes a package available for use in a particular session. We have to use library to load the package every time we start a new R session if we plan to access the functions in that package: loading and attaching a package via library is a “do every time” operation.

5.4 Package data

Remember what Hadley Wickam said about packages? “… include reusable R functions, the documentation that describes how to use them, and sample data.” Many packages include sample data sets for use in examples and package vignettes. Use the data function to list the data sets hiding away in packages:

data(package = .packages(all.available = TRUE))

The mysterious .packages(all.available = TRUE) part of this generates a character vector with the names of all the installed packages in it. If we only use data() then R only lists the data sets found in a package called datasets and any additional packages we have loaded in the current R session.

The datasets package is part of the base R distribution. It exists for one reason—to store example data sets. The datasets package is automatically loaded when we start R, i.e. there’s no need to use library to access it, meaning any data stored in this package can be accessed every time we start R.

From the perspective of learning to use R, working with package data is really useful because it allows us to work with ‘well-behaved’ data sets without having to worry about getting them into R. Importing data is certainly an important skill but it’s not necessarily something a new user wants to worry about. For this reason, we tend to use package data sets in this book.

5.5 The tidyverse ecosystem of packages

We’re going to be using several packages that belong to a widely used, well-known ecosystem of packages known as the tidyverse. Here is the description of the tidyverse on its website:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Using the tidyverse makes data analysis simpler, faster and a more entertaining experience (honestly).