Chapter 12 Exploratory data analysis
12.1 Introduction
Exploratory data analysis (EDA) was promoted by the statistician John Tukey in his 1977 book, “Exploratory Data Analysis”. The broad goal of EDA is to help us formulate and refine hypotheses that lead to informative analyses or further data collection. The core objectives of EDA are:
- to suggest hypotheses about the causes of observed phenomena,
- to guide the selection of appropriate statistical tools and techniques,
- to assess the assumptions on which statistical analysis will be based,
- to provide a foundation for further data collection.
EDA involves a mix of numerical and visual methods of analysis. Statistical methods are sometimes used to supplement EDA. However, the main purpose of EDA is to facilitate understanding before carrying out formal statistical modelling. Even if we already know what kind of analysis we plan to pursue, it’s always a good idea to explore a data set before diving into that analysis. At the very least, this will help determine whether or not our plans are sensible. Very often, it uncovers new patterns and insights.
In this chapter, we’re going to examine some basic concepts that underpin EDA. We will:
- see how to classify different types of variables,
- distinguish between populations and samples, and
- review some key descriptive statistics.
This will provide a conceptual foundation and vocabulary for learning how to explore data in later chapters.
12.2 Statistical variables and data
In the Data frames chapter, we pointed out that the word ‘variable’ can mean one of two things. In programming, a variable is a name-value association that we create when we run some code. Statisticians use the word differently. To them, a variable is any characteristic or quantity that can be measured, classified or experimentally controlled. Much of statistics is about quantifying and explaining the variation in such quantities as best we can.
Species richness, relative abundance, infection status, enzyme activity, gene frequency, and blood glucose concentration are examples of statistical variables we encounter in the biological sciences. These as statistical variables because their values vary between different observations. For instance, ‘annual fitness’—measured as the number of offspring produced—is a variable that differs both among the organisms in a population and over the life cycle of a given individual.
There are different ways to describe statistical variables according to how they can be analysed, measured, or presented. It’s important to be clear about what kind of variables we’re dealing with because this determines how we should visualise the data, and later, how we might analyse it statistically. There are many different ways to go about classifying variables. However, we only need to be aware of two fairly simple classification schemes in this book: numeric vs categorical variables and ratio vs interval scales.
12.2.1 Numeric vs categorical variables
Numeric variables have values that describe a measurable quantity like ‘how many’ or ‘how much’ as a number. Numeric variables are also called quantitative variables; the data collected containing numeric variables are called quantitative data. Numeric variables may be classified as either continuous or discrete:
Continuous numeric variable: Observations can take any value in a set of real numbers, i.e. numbers represented with decimals. Examples of continuous variables include concentration, mass, age, time, and temperature. The set of numbers a continuous variable takes is typically either ‘every possible number’ or ‘just positive numbers’. For example, a concentration may be very large or very small, but it is strictly positive, whereas a change in concentration can be positive or negative.
Discrete numeric variable: Observations can take a value based on a count from a set of whole values, e.g. 1, 2, 3, 4, 5, and so on. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of individuals in a population, the number of offspring produced (‘reproductive fitness’), and the number of infected individuals in an experiment. All of these are measured as whole units.
Categorical variables take values that describe a characteristic of a data unit, like ‘what type’ or ‘which category’. Categorical variables fall into mutually exclusive (in one category or in another) and exhaustive (include all possible options) categories. Categorical variables are qualitative variables and tend to be represented by a non-numeric value; the data collected for a categorical variable are called qualitative data. Categorical variables may be further described as ordinal or nominal:
Ordinal variable: Categories can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another but do not necessarily establish a numeric difference between each category. Examples of ordinal categorical variables include academic grades (e.g. A, B, C) and size classes (e.g. small, medium, large).
Nominal variable: Categories cannot be organised into a logical sequence. Examples of nominal categorical variables include sex (see C. elegans example), human blood group (A, B, AB and O), genotype (e.g. AA, Aa, aa), experimental conditions (e.g. control vs enhanced nutrition), and mortality status (alive vs dead).
Do not use numbers to classify categorical variables
Be careful when classifying variables. It can be dangerous to assume that just because a numerical scheme has been used to describe a variable, it must not be categorical. There is nothing to stop someone from using numbers to describe a categorical variable (e.g. C. elegans sex: Male = 1, Hermaphrodite = 2). That said, although we can use numbers to describe categories, it does not mean we should. Using numbers gets confusing and can lead to mistakes. It is much clearer to use a non-numeric recording scheme based on words or acronyms to record categorical variables (e.g. C. elegans sex: Male = “Male”, Hermaphrodite = “Herm”).
12.2.2 Ratio vs interval scales
A second way of classifying numeric variables (not categorical variables) relates to the scale they are measured on. The measurement scale is important because it determines how we interpret things like differences, ratios, and variability.
Ratio scale: This scale does possess a meaningful zero value. It gets its name from the fact that a measurement on this scale represents a ratio between a measure of the magnitude of a quantity and a unit of the same kind. What this means in simple terms is that it is meaningful to say that something is “twice as …” as something else when working with a variable measured on a ratio scales. Ratio scales most often appear when we work with physical quantities. For example, we can say that one tree is twice as big as another or that one elephant has twice the mass of another because length and mass are measured on ratio scales.
Interval scale: This allows for the degree of difference between measurements, but not the ratio between them. This kind of scale does not have a unique, non-arbitrary zero value. A good example of an interval scale is the date, which we measure relative to an arbitrary epoch (e.g. AD). It makes no sense to say that 2000 AD is twice as long as 1000 AD. However, we can compare ratios of differences on an interval scale. For example, it does make sense to talk about the amount of time between two dates, i.e. we can to say that twice as much time has passed since the epoch in 2000 AD versus 1000 AD.
Keep in mind that the distinction between ratio and interval scales is a property of the measurement scale, not the thing being measured. For example, when we measure temperature in º C we’re working on an interval scale defined relative to the freezing and boiling temperatures of water under standard conditions. It doesn’t make any sense to say that 30º C is twice as hot as 15º C. However, if we measured the same two temperatures on the Kelvin scale, it is meaningful to say that 303.2K is 1.05 times hotter than 288.2K. This is because the Kelvin scale is relative to a true zero (absolute zero).
12.3 Populations and samples
Whenever we collect data, we are almost always working with a sample drawn from a wider population. We might want to know something about that population, but since it is impossible to study the whole population, we study the properties of one or more samples instead. For example, a physiologist might want to know how exercise affects lung function. Since they obviously can’t study every person on the planet, they have to study a small sample of people.
We mention the distinction between populations and samples because EDA is concerned with exploring the properties of samples. EDA aims to characterise a sample without trying to infer too much about the wider population from which it is derived. Learning about populations is the basis of much of statistics. That topic is best dealt with in a dedicated statistics book—not this one.
12.3.1 Sample distributions
When we say that ‘EDA is concerned with exploring the properties samples,’ we actually mean that EDA is concerned with the properties of variables in one or more samples. We can be even more precise—the property we are alluding to is the variable’s sample distribution. The distribution of a variable describes the relative frequency with which different values occur. This is best understood by example…
Imagine we took a sample of undergraduates and measured their height. The majority of students would be around about 1.7m tall, even though there would be plenty of variation among students. Men would tend to be slightly taller than women, and very small or very tall people would be rare. We know from experience that no one in this sample would be over 3 meters tall. These are all statements about a hypothetical sample distribution of undergraduate heights.
12.3.2 Associations
We’ve been talking about statistical variables as though we study them one at a time. However, most interesting samples involve more than one variable, and the goal of the ensuing data analysis is to understand associations among those variables. These associations might involve the same (e.g. numeric vs numeric) or different (e.g. numeric vs categorical) types of variable. Whatever the details, the aim of EDA is often to understand how each variable in a sample relates to the others.
12.4 Types of EDA
Exploratory data analysis involves questions such as:
- What are the most common values of the variable?
- How much do observations differ from one another?
- Is one variable associated with another?
Rather than describe the answers to such questions in purely verbal terms, as we did above, EDA relies on descriptive statistics and graphical summaries:
Descriptive statistics. Descriptive statistics are used to quantify the basic features of a sample distribution. They provide numerical summaries about the sample that can be used to make comparisons and draw preliminary conclusions. For example, we often use the mean to summarise the ‘most likely’ values of a variable.
Graphical summaries. Descriptive statistics are not much use on their own—a few numbers can’t capture every aspect of a distribution. Graphical summaries are a powerful complement to descriptive statistics because they capture a lot of information about a sample in a way that is easy for people to understand.
Descriptive statistics are important, but they are not the main focus of this book. The next few chapters will set out how to construct a range of useful graphical data summaries. Indeed, this book’s overarching goal is to demonstrate a workflow that starts with raw data and ends in one or more exploratory plots. We do need to know a bit about descriptive statistics to understand some of those plots. To that end, we’ll finish this chapter with a quick survey of descriptive statistics that will pop up later.
12.5 A primer of descriptive statistics
12.5.1 Numeric variables
So far, we’ve only mentioned the properties of sample distributions in very general terms—using phrases like ‘most common values’ and ‘the range of the data’—without really saying what we mean. Statisticians have devised a set of terms to describe distributions and various descriptive statistics to quantify these. The two that matter most for numeric variables are the central tendency and the dispersion:
A measure of central tendency describes a typical value of a distribution. Most people know at least one measure of central tendency. The “average” that they calculated at school is the arithmetic mean of a sample. There are many different measures of central tendency, each with its own pros and cons. Take a look at the Wikipedia page to see the most common ones. Among these, the median is the one that is used most often in exploratory analyses.
A measure of dispersion describes how spread out a distribution is. Dispersion measures quantify the variability or scatter of a variable. If one distribution is more dispersed than another, it means that it encompasses a wider range of values. What this means in practice depends on the kind of measure we’re working with. We tend to focus on the variance, the standard deviation, and the interquartile range. There are many others, though.
Beyond central tendency and dispersion
Another important aspect of a distribution is its skewness (a.k.a. ‘skew’). There are many different ways to quantify skewness. Unfortunately, these are quite difficult to make sense of. For now, we only need to understand what skewness means in qualitative terms. Skewness refers to the (a)symmetry of a distribution. When we talk about distributions with high skew, we mean they are very asymmetric.
12.5.1.1 Central tendency
The central tendency of a numeric variable’s sample distribution is typically described using either the arithmetic mean or the median. The arithmetic mean of a sample is ‘the mean’ that everyone learns at school8. Most people have calculated the mean by hand at some point. As R users, we can use the mean
function to calculate the arithmetic mean if we need it. For example, this will calculate the mean body mass in the penguins
data set:
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
Remember—we used na.rm = TRUE
here because body_mass_g
contains a small number of missing (NA
) values. This calculation tells us the arithmetic mean of body mass is 4202 grams, i.e. in some sense, the most common body mass is about 4202 grams.
One limitation of the arithmetic mean is that it is affected by the shape of a distribution. This is why, for example, it does not make much sense to look at the mean income of workers to get a sense of what a ‘typical’ person earns. Income distribution are highly skewed, such that a few people receive very large salaries compared to the vast majority of the population. Those few who earn very good salaries tend to shift the mean well past anything that is really ‘typical’.
Because the sample mean is sensitive to the shape of a distribution in this way, we often prefer to use a more robust measure of central tendency—the sample median. The median of a sample is the value that separates the upper half from the lower half. We can find the sample median with the median
function in R:
median(penguins$body_mass_g, na.rm = TRUE)
## [1] 4050
This tells us that the arithmetic mean of body mass in the penguins
data is 4050 grams. This is less than the mean, reflecting the fact that the body mass distribution is somewhat asymmetric.
What about ‘the mode’?
The mode of a variable’s distribution is simply the value that is most likely to occur. This is a simple idea. Unfortunately, it is often difficult to estimate the mode from a sample. Nonetheless, it is important to know what the mode represents, because the concept is useful even when the actual value is hard to estimate.
12.5.1.2 Dispersion
There are many ways to quantify the dispersion of a sample distribution. The most important quantities from the standpoint of statistics are the sample variance and standard deviation. The sample variance (\(s^2\)) is the average squared deviations of each point from the mean. Variances are non-negative. The larger the variance, the more observations are spread out around the mean. A variance of zero only occurs if all values are identical.
We won’t waste time showing the formula because we’ll never actually need to use it directly. As usual, R can calculate the sample variance if we need it:
var(penguins$body_mass_g, na.rm = TRUE)
## [1] 643131.1
That’s a big number. What does it mean? Is it ‘big’ or is it ‘small’? No idea. That’s the problem with variances—they are difficult to interpret because their calculation involves squared deviations. The variance is an important quantity in statistics because many common tools use changes in the variance as a basis for statistical tests. However, variances seldom feature in EDA because they are so hard to interpret.
A somewhat better descriptive statistic is to describe sample dispersion is a closely related quantity called the standard deviation of the sample, usually denoted \(s\). The standard deviation is equal to the square root of the variance. We calculate it using the sd
function:
sd(penguins$body_mass_g, na.rm = TRUE)
## [1] 801.9545
Why do we prefer the standard deviation over the variance? Because it is the square root of the variance, it operates on the same scale as the variable it summarises. This means it reflects the dispersion we perceive in the data. The sample standard deviation is not without problems, though. Like the sample mean, it is sensitive to the shape of a distribution and the presence of outliers.
A measure of dispersion that is robust to these kinds of problems is the interquartile range. The interquartile range (IQR) is defined as the difference between the third and first quartile (see box). The IQR contains the middle 50% of the values of a variable. The more spread out the data, the larger the IQR. People prefer IQR to measure dispersion for exploratory work because it only depends on the ‘middle’ of a distribution. This makes it robust to the presence of outliers.
We can use the IQR
function to find the interquartile range of the body mass variable:
IQR(penguins$body_mass_g, na.rm = TRUE)
## [1] 1200
The IQR is used as the basis for a useful data summary plot called a ‘box and whiskers’ plot. We’ll see how to construct this later.
What are quartiles?
We need to know what a quartile is to understand the interquartile range. Three quartiles are defined for any sample. These divide the data into four equal-sized groups, from the set of smallest numbers up to the set of largest numbers. The second quartile (\(Q_2\)) is the median, i.e. it divides the data into an upper and lower half. The first quartile (\(Q_1\)) is the number that divides the lower 50% of values into two equal-sized groups. The third quartile (\(Q_3\)) is the number that divides the upper 50% of values into two equal-sized groups.
12.5.2 Categorical variables
Descriptive statistics of categorical variables aim to quantify specific features of their sample distribution, just as with numeric variables. The general question we need to address is, what are the relative frequencies of different categories? Because categorical variables take a finite number of values, the simplest thing we can do is tabulate the number of occurrences of each type. We can use the dplyr count
function to do this:
%>% count(species) penguins
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
This prints that the number of observations associated with each species in penguins
. The summary reveals that the most common species in the data set is the Adelie penguin, followed by the Gentoo and Chinstrap.
Can we quantify the central tendency of a categorical sample distribution? Various measures exist. We can certainly find the sample mode of a categorical variable easily enough. This is just the most common category. In the case of the above species
variable, the mode is obviously Adelie. It is also possible to calculate a sample median of a categorical variable, but only when it is ordinal. Since the median value is the one that lies in the middle of an ordered set of values, it makes no sense to talk about the middle of a set of nominal values that have no inherent order.
What about dispersion? Well, measures of dispersion for categorical variables do exist, but they are not very easy to interpret. They seldom get used in exploratory data analysis so let’s not worry about them here.
12.5.3 Associations
Statisticians have devised different ways to quantify an association between variables. The common measures calculate some kind of correlation coefficient. The terms ‘association’ and ‘correlation’ are closely related, so they are often used interchangeably. Strictly speaking, correlation has a narrower definition: a correlation is defined by a metric (the ‘correlation coefficient’) that quantifies the degree to which an association tends to a certain pattern.
12.5.3.1 Pairs of numeric variables
A widely used measure of correlation for pairs of numeric variables is Pearson’s correlation coefficient (a.k.a. Pearson product-moment correlation coefficient, or Pearson’s \(r\)). Remember, a correlation coefficient quantifies the degree to which an association tends to a certain pattern. Pearson’s correlation coefficient is designed to summarise the strength of a linear (i.e. straight line) association.
Pearson’s correlation coefficient takes a value of 0 if two variables are not linearly associated and a value of +1 or -1 if they are perfectly related and represent a straight line. A positive value indicates that high values in one variable are associated with high values of the second; a negative value indicates that high values of one variable is associated with low values of the second. High values are those that are greater than the mean; low values are those that are less than the mean.
We can use the cor
function to calculate Pearson’s correlation coefficient. For example, the Pearson correlation coefficient between flipper length and body mass is given by:
cor(penguins$flipper_length_mm, penguins$body_mass_g, use = "complete.obs")
## [1] 0.8712018
This is positive, indicating flipper length tends to increase with body mass. It is also quite close to +1, indicating the association is strong. We should interpret Pearson’s correlation coefficient with care. Because it is designed to summarise the strength of a linear relationship, Pearson’s correlation coefficient will mislead when this relationship is curved. If that statement does not make immediate sense, take a look at the famous Anscombe’s quartet.
Other measures of correlation
What should we do if we think the relationship between two variables is non-linear? Calculate something called a rank correlation. The idea is quite simple. Instead of working with the actual values of each variable, we rank them by value, i.e. sort each variable from lowest to highest and the assign the labels 1st , 2nd, 3rd, … to observations. Rank correlations are based on the association among the ranks of two variables.
The two most popular rank correlation coefficients are Spearman’s \(\rho\) (‘rho’) and Kendall’s \(\tau\) (‘tau’). The differences are minimal:
- Spearman’s \(\rho\) is a bit more sensitive to outliers in the data.
- Kendall’s \(\tau\) can be slow to calculate for large data sets.
We can either coefficient using the cor
function, setting the method
argument to the appropriate value: method = "kendall"
or method = "spearman"
. A rank correlation coefficient is interpreted in the same way as Pearson’s correlation coefficient. It takes a value of 0 if the ranks are uncorrelated and +/- 1 if they are perfectly associated (though not necessarily as a straight line). The sign tells us about the direction of the association.
12.5.3.2 Pairs of categorical variables
Quantifying associations between pairs of categorical variables is not as simple as the numeric case. The general question is, “do different combinations of categories seem to be under- or over-represented?” We need to understand which combinations are common and which are rare. The simplest thing we can do is ‘cross-tabulate’ the number of occurrences (i.e. the ‘frequencies’) of each combination. The resulting table is called a contingency table.
The xtabs
function (= ‘cross-tabulation’) can do this. For example, the frequencies of each penguin species and island combination is given by:
xtabs(~ species + island, data = penguins)
## island
## species Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
The first argument sets the variables to cross-tabulate. xtabs
uses R’s special formula language, which means we must include that ~
symbol at the beginning. After the ~
, we provide the list of variables to cross-tabulate, separated by the +
sign. The second argument tells the function which data set to use.
The table above shows us how many observations are associated with each combination of the species
and island
categories. This particular case represents a fairly extreme example of (dis)association; the Chinstrap and Gentoo species simply don’t occur on certain islands (or perhaps they weren’t sampled on those islands for some reason).
What about measures of association such as correlation coefficients? Spearman’s \(\rho\) and Kendall’s \(\tau\) are designed for numeric variables, but these can also be used to measure the correlation between ordinal variables (Kendall’s \(\tau\) is best). Various measures of association have been constructed for pairs of nominal variables (e.g. Cramér’s V). However, none of these are widely used in exploratory data analyses—people tend to stick with graphical tools for categorical data.
People often just say ‘the mean’ when referring to the arithmetic sample mean. This is fine, but keep in mind that there are other kinds of mean , such as the harmonic mean and the geometric mean.↩︎