Chapter 4 Data frames
4.1 Introduction
The quick introduction to R chapter introduced the word ‘variable’ as a short-hand for a named object. For example, we can make a variable called num_vec
that refers to a simple numeric vector using:
<- c(1.1, 2.3, 4.0, 5.7)
num_vec num_vec
## [1] 1.1 2.3 4.0 5.7
When a computer scientist talks about variables they’re referring to these sorts of name-value associations.
However, the word ‘variable’ has a second, more abstract meaning in the world of statistics: it refers to anything we can control or measure. For example, data from an experiment will involve variables whose values describe the experimental conditions (e.g. low temperature vs. high temperature) and the quantities we chose to measure (e.g. enzyme activity). We refer to these kinds of variables as ‘statistical variables’.
We’ll discuss these statistical variables later on. We’re pointing out the dual meaning of the word ‘variable’ now because we need to work with both interpretations. The duality can be confusing at times, but both meanings are in widespread use, so we just have to get used to them. We try to minimise confusion by using the phrase “statistical variable” when referring to data, rather than R objects.
We’re introducing these ideas now because we’re going to consider a new type of data object in this chapter—the data frame. Real-world data analysis involves collections of related statistical variables. How should we keep a large collection of variables organised? We could work with them individually, but this tends to be error prone. Instead, we need a way to keep related variables together. This is the problem that data frames are designed to manage.
4.2 Data frames
Data frames are one of R’s features that mark it out as particularly good for data analysis. We can think of a data frame as a table-like object with rows and columns. A data frame collects together different statistical variables, storing each of them as a separate column. Related observations are all found in the same row.
We’ll think about the columns first.
Each column of a data frame is a vector of some kind. These are usually simple atomic vectors containing numbers or character strings, though it is possible to include more complicated vectors. The critical constraint that a data frame applies is that each vector must have the same length. This is what gives a data frame it table-like structure.
The simplest way to get a feel for data frames is to make one. We can make one ‘by hand’ using some artificial data describing a made-up experiment. Imagine we had conducted a small experiment to examine biomass and community diversity in six field plots. Three plots were subjected to fertiliser enrichment. The other three plots act as experimental controls.
We could store the data describing this experiment in three vectors:
trt
(short for “treatment”) shows which experimental manipulation was used in a given plot.bms
(short for “biomass”) shows the total biomass measured at the end of the experiment.div
(short for “diversity”) shows the number of species present at the end of the experiment.
Here’s some R code to generate these three vectors (it doesn’t matter what the actual values are, they’re made up):
<- rep(c("Control","Fertilser"), each = 3)
trt <- c(284, 328, 291, 956, 954, 685)
bms <- c(8, 12, 11, 8, 4, 5) div
trt
## [1] "Control" "Control" "Control" "Fertilser" "Fertilser" "Fertilser"
bms
## [1] 284 328 291 956 954 685
div
## [1] 8 12 11 8 4 5
Notice that the information about different observations are linked by their positions in these vectors. For example, the third control plot had a biomass of ‘291’ and a species diversity ‘11’.
We use the data.frame
function to construct a data frame from one or more vectors, i.e. to build a data frame from the three vectors we just created:
<- data.frame(trt, bms, div)
experim.data experim.data
## trt bms div
## 1 Control 284 8
## 2 Control 328 12
## 3 Control 291 11
## 4 Fertilser 956 8
## 5 Fertilser 954 4
## 6 Fertilser 685 5
Notice what happens when we print the data frame: it is displayed as though it has rows and columns. That’s what we meant when we said a data frame is a table-like structure.
The data.frame
function takes a variable number of arguments. We used the trt
, bms
and div
vectors, resulting in a data frame with three columns. Each of these vectors has 6 elements, so the resulting data frame has 6 rows. The names of the vectors were used to name its columns. The rows do not have names, but they are numbered to reflect their position.
The words trt
, bms
and div
are not very informative. If we prefer to work with more meaningful column names—which is always a good idea—then we can name the data.frame
arguments:
<- data.frame(Treatment = trt, Biomass = bms, Diversity = div)
experim.data experim.data
## Treatment Biomass Diversity
## 1 Control 284 8
## 2 Control 328 12
## 3 Control 291 11
## 4 Fertilser 956 8
## 5 Fertilser 954 4
## 6 Fertilser 685 5
The new data frame contains the same data as the previous one but now the column names correspond to the human-readable words we chose.
Don’t bother with row names
We can also name the rows of a data frame using the row.names
argument of the data.frame
function. We won’t bother to show an example of this, though. Why? We can’t easily work with the information in row names which means there’s not much point adding it. If we need to include row-specific information in a data frame it’s best to include an additional variable, i.e. an extra column.
4.3 Exploring data frames
The first things to do when presented with a new data set is to explore its structure to understand what we’re dealing with. There are plenty of options for doing this when the data are stored in a data frame. For example, the head
and tail
functions extract the first and last few rows of a data set:
head(experim.data, n = 3)
## Treatment Biomass Diversity
## 1 Control 284 8
## 2 Control 328 12
## 3 Control 291 11
tail(experim.data, n = 3)
## Treatment Biomass Diversity
## 4 Fertilser 956 8
## 5 Fertilser 954 4
## 6 Fertilser 685 5
Notice that the n
argument controls the number of rows printed. The View
function can be used to open up the whole data set in a table- or spreadsheet-like like view:
View(experim.data)
Exactly what happens when we use View
depends on how we’re interacting with R. When we run it in RStudio a new tab opens up with the data shown inside it.
View
only displays the data
The View
function is only designed to display a data frame as a table of rows and columns. We can’t change the data in any way with the View
function. We can reorder the way the data are presented, but keep in mind that this won’t alter the underlying data.
There are quite a few different R functions that will extract information about a data frame. The nrow
and ncol
functions return the number of rows and columns, respectively:
nrow(experim.data)
## [1] 6
ncol(experim.data)
## [1] 3
The names
function can be used to extract the column names from a data frame:
colnames(experim.data)
## [1] "Treatment" "Biomass" "Diversity"
The experim.data
data frame has three columns, so names
returns a character vector of length three, where each element corresponds to a column name.
4.4 Extracting and adding a single variable
Remember, each column of a data frame can be thought of as a variable. Data frames would not be much use if we could not extract and modify the variables they contain. In this section, we will briefly review how to extract a variable. We’ll examine ways to manipulate the data within a data frame in later chapters.
One way of extracting a variable from a data frame uses a double square brackets construct, [[
. For example, we extract the Biomass
variable from our example data frame with the double square brackets like this:
"Biomass"]] experim.data[[
## [1] 284 328 291 956 954 685
This prints whatever is in the Biomass
column to the Console. What kind of object is this? It’s a numeric vector:
is.numeric(experim.data[["Biomass"]])
## [1] TRUE
See? A data frame is a collection of vectors. Notice that all we did was print the resulting vector to the Console. If we want to do something with this numeric vector we need to assign the result:
<- experim.data$Biomass
bmass ^2 bmass
## [1] 80656 107584 84681 913936 910116 469225
Here, we extracted the Biomass
variable, assigned it to bmass
, and then squared this.
Notice that we used "Biomass"
instead of Biomass
inside the double square brackets, i.e. we quoted the name of the variable. This is because we want R to treat the word “Biomass” as a literal value. This little detail is important! If we don’t quote the name, R will assume that Biomass
is the name of an object and go in search of it in the global environment. Since we haven’t created something called Biomass
, leaving out the quotes would generate an error:
experim.data[[Biomass]]
## Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, : object 'Biomass' not found
The error message is telling us that R can’t find a variable called Biomass
.
The second method for extracting a variable uses the $
operator. For example, to extract the Biomass
column from experim.data
, we use:
$Biomass experim.data
## [1] 284 328 291 956 954 685
We use the $
operator by placing the name of the data frame we want to work with on the left-hand side and the name of the column (i.e. the variable) we want to extract on the right-hand side. Notice that we didn’t have to put quotes around the variable name when using the $
operator. We can do this if we want to—i.e. experim.data$"Biomass"
also works—but $
doesn’t require it.
Why is there more than one way to extract variables?
There’s no simple way to answer this question without getting into the details of how R represents data frames. The simple answer is that $
and [[
are not strictly equivalent, even though they appear to do much the same thing. The $
method is a bit easier to read, and people tend to prefer it for interactive data analysis tasks, whereas the [[
construct tends to be used when we need a bit more flexibility for programming.