Chapter 9 One sample t-tests
9.1 When do we use a one-sample t-test?
The one-sample t-test is a very simple statistical test. It is used when we have a sample of a numeric variable and we want to compare its population mean to a particular value. The one-sample t-test evaluates whether the population mean is likely to be different from this value. The expected value could be any value we are interested in. Here are a couple of specific examples:
We have developed a theoretical model of foraging behaviour that predicts an animal should leave a food patch after 10 minutes. Suppose we have data on the actual time spent by 25 animals observed foraging in the patch. In that case, we could test whether the mean foraging time is ‘significantly different’ from the prediction using a one-sample t-test.
We are monitoring sea pollution and have collected a series of water samples from a beach. We wish to test whether the mean density of faecal coliforms—bacteria indicative of sewage discharge—can be regarded as greater than the legislated limit. A one-sample t-test will test whether the mean value for the beach as a whole exceeds this limit.
Let’s see if we can get a sense of how these tests would work.
9.2 How does the one-sample t-test work?
Imagine we have taken a sample of some variable, and we want to evaluate whether its mean is different from some number (the ‘expected value’).
Here’s an example of what these data might look like if we had used a sample size of 50:
We’re calling the variable ‘X’ in this example. It needs a label, and ‘X’ is as good as any other. The red line shows the sample mean. This is a bit less than 11. The blue line shows the expected value. This is 10, so this example could correspond to the foraging study mentioned above.
The observed sample mean is about one unit larger than the expected value. How do we decide whether the population mean is really different from the expected value? Perhaps the difference between the observed and expected value is due to sampling variation. Here’s how a frequentist tackles this kind of question:
Set up an appropriate null hypothesis, i.e. an hypothesis of ‘no effect’ or ‘no difference’. The null hypothesis in this type of question is that the population mean is equal to the expected value.
Work out what the sampling distribution of a sample mean looks like under the null hypothesis. This is the null distribution. Because we’re now using a parametric approach, we will assume this has a particular form.
Finally, we use the null distribution to assess how likely the observed result is under the null hypothesis. This is the p-value calculation that we use to summarise our test.
This chain of reasoning is no different from that developed in the bootstrapping example from earlier. We’re just going to make an extra assumption this time to allow us to use a one-sample t-test. This extra assumption is that the variable (‘X’) is normally distributed in the population. When we make this normality assumption, the whole process of carrying out the statistical test is actually very simple because the null distribution will have a known mathematical form—it ends up being a t-distribution.
We can use this knowledge to construct the test of statistical significance. Instead of using the whole sample, as we did with the bootstrap, now we only need three simple pieces of information to construct the test: the sample size, the sample variance, and the sample mean. The one-sample t-test is then carried out as follows:
Step 1. Calculate the sample mean. This happens to be our ‘best guess’ of the unknown population mean. However, its role in the one-sample t-test is to allow us to construct a test statistic in the next step.
Step 2. Estimate the standard error of the sample mean. This gives us an idea of how much sampling variation we expect to observe. The standard error doesn’t depend on the true value of the mean, so the standard error of the sample mean is also the standard error of any mean under any particular null hypothesis.
This second step boils down to applying a simple formula involving the sample size and the standard deviation of the sample:
\[\text{Standard Error of the Mean} = \sqrt{\frac{s^2}{n}}\]
…where \(s^2\) is the square of the standard deviation (the sample variance) and \(n\) is for the sample size. This is the formula introduced in the parametric statistics chapter. The standard error of the mean gets smaller as the sample size grows or the sample variance shrinks.
Step 3. Calculate a ‘test statistic’ from the sample mean and standard error. We calculate this by dividing the sample mean (step 1) by its estimated standard error (step 2):
\[\text{t} = \frac{\text{Sample Mean}}{\text{Standard Error of the Mean}}\]
If our normality assumption is reasonable, this test-statistic follows a t-distribution. This is guaranteed by the normality assumption. So this particular test statistic is also a t-statistic. That’s why we label it t. This knowledge leads to the final step…
Step 4. Compare the t-statistic to the theoretical predictions of the t-distribution to assess the statistical significance of the difference between observed and expected value. We calculate the probability that we would have observed a difference with a magnitude as large as, or larger than, the observed difference, if the null hypothesis were true. That’s the p-value for the test.
We could step through the actual calculations involved in these steps in detail, using R to help us, but there’s no need to do this. We can let R handle everything for us. But first, we should review the assumptions of the one-sample t-test.
9.2.1 Assumptions of the one-sample t-test
Several assumptions need to be met in order for a one-sample t-test to be valid. Some of these are more important than others. We’ll start with the most important and work down the list in reverse order of importance:
- Independence. In rough terms, independence means each observation in the sample does not ‘depend on’ the others. We’ll discuss this more carefully when we consider the principles of experimental design. The key thing to know now is why this assumption matters: if the data are not independent, the p-values generated by the one-sample t-test will be unreliable.
(In fact, the p-values will be too small when the non-independence assumption is broken. That means we risk the false conclusion that a difference is statistically significant when in reality, it is not)
Measurement scale. The variable being analysed should be measured on an interval or ratio scale, i.e. it should be a numeric variable of some kind. Applying a one-sample t-test to a variable that is not measured on one of these scales doesn’t make much sense.
Normality. The one-sample t-test will only produce completely reliable p-values when the variable is normally distributed in the population. However, this assumption is less important than many people think. The t-test is robust to mild departures from normality when the sample size is small, and when the sample size is large, the normality assumption hardly matters at all.
We don’t have the time to explain why the normality assumption is not too important for large samples, but we can at least state the reason: it is a consequence of that central limit theorem we mentioned in the last chapter.
9.2.2 Evaluating the assumptions
The first two assumptions—independence and measurement scale—are aspects of experimental design. We can only evaluate these by thinking carefully about how the data were gathered and what was measured. It’s too late to do anything about these after we have collected our data.
What about that 3rd assumption—normality? One way to evaluate this is by visualising the sample distribution. For small samples, if the sample distribution looks approximately normal then it’s probably fine to use the t-test. For large samples, we don’t even need to worry about a moderate departure from normality3.
9.3 Carrying out a one-sample t-test in R
As usual, a good way to understand what follows is by working through the example. Remember, the code again assumes that ‘MORPH_DATA.CSV’ has been read into a tibble called morph_data
.
We’ll use the plant morph example again to learn how to carry out a one-sample t-test in R. Remember, the data were ‘collected’ to 1) compare the frequency of purple morphs to a prediction and 2) compare the mean dry weight of purple and green morphs. Neither of these questions can be tackled with a one-sample t-test.
Instead, let’s pretend that we unearthed a report from 30 years ago that found the mean size of purple morphs to be 710 grams. We want to evaluate whether the mean size of purple plants in the contemporary population is different from this expectation, because we think they may have adapted to local conditions.
We only need the purple morph data for this example, so we should first filter
the data to get hold of only the purple plants:
# get just the purple morphs
<- filter(morph_data, Colour == "Purple") purple_morphs
The purple_morphs
data set has two columns: Weight
contains the dry weight biomass of purple plants, and Colour
indicates which sample (plant morph) an observation belongs to. We don’t need the Colour
column anymore because we’ve just removed all the green plants, but there’s no harm in leaving it in.
9.3.1 Visualising the data and checking the assumptions
We start by calculating a few summary statistics and visualising the sample distribution of purple morph dry weights. That’s right… always check the data before carrying out a statistical analysis! We already looked over these data in the Statistical comparisons chapter. However, we’ll proceed as though this is the first time we’ve seen them to demonstrate a complete workflow.
Here is the dplyr code to produce the descriptive statistics:
%>%
purple_morphs summarise(mean = mean(Weight),
sd = sd(Weight),
samp_size = n())
## # A tibble: 1 × 3
## mean sd samp_size
## <dbl> <dbl> <int>
## 1 767. 156. 77
We have 77 purple plants in the sample. Not bad but we should keep an eye on the normality assumption. Let’s check this by making a histogram with the sample of purple plant dry weights:
ggplot(purple_morphs, aes(x = Weight)) +
geom_histogram(binwidth = 50)
These is nothing too ‘non-normal’ about this sample distribution—it’s roughly bell-shaped—so it seems reasonable to assume it came from normally distributed population.
9.3.2 Carrying out the test
It’s easy to carry out a one-sample t-test in R. We use a function called t.test
(no surprises there). Remember, Weight
contains the dry weight biomass of purple plants. Here’s the R code to carry out a one-sample t-test:
t.test(purple_morphs$Weight, mu = 710)
We have suppressed the output because we initially want to focus on how to use the t.test
function. We have to assign two arguments to control what the function does:
The first argument (
purple_morphs$Weight
) is simply a numeric vector containing the sample values. Sadly we can’t givet.test
a data frame when doing a one-sample test. Instead, we have to pull out the column we’re interested in using the$
operator.The second argument (called
mu
) sets the expected value to which we want to compare the sample mean. Somu = 710
tells the function to compare the sample mean to a value of 710. This can be any value we like, and the correct value depends on the question we’re asking.
That’s it for setting up the test. Here is the R code for the test again, this time with the output included:
t.test(purple_morphs$Weight, mu = 710)
##
## One Sample t-test
##
## data: purple_morphs$Weight
## t = 3.1811, df = 76, p-value = 0.002125
## alternative hypothesis: true mean is not equal to 710
## 95 percent confidence interval:
## 731.1490 801.9787
## sample estimates:
## mean of x
## 766.5638
The first line tells us what kind of t-test we used. This says: One Sample t-test
. OK, now we know that we used the one-sample t-test (there are other kinds). The next line reminds us about the data. This says: data: purple_morphs$Weight
, which is R-speak for ’we compared the mean of the Weight
variable to an expected value. Which value? This is given later.
The third line of text is the most important. This says: t = 3.1811, df = 76, p-value = 0.002125
. The first part of this, t = 3.1811
, is the test statistic, i.e. the value of the t-statistic. The second part, df = 76
, summarise the ‘degrees of freedom’. This is required to work out the p-value. It also tells us something about how much ‘power’ our statistical test has (see the box below). The third part, p-value = 0.002125
, is the all-important p-value.
That p-value indicates a statistically significant difference between the mean dry weight biomass and the expected value of 710 g (p is less than 0.05). Because the p-value is less than 0.01 but greater than 0.001, we report this as ‘p < 0.01’. Read through the Presenting p-values section again if this logic is confusing.
The fourth line of text (alternative hypothesis: true mean is not equal to 710
) tells us the alternative to the null hypothesis (H1). More importantly, this reminds us which expected value was used to formulate the null hypothesis (mean = 710).
The next two lines show us the ‘95% confidence interval’ for the difference between the means. We don’t need this information now, but we can think of this interval as a rough summary of the likely values of the ‘true’ mean4.
The last few lines summarise the sample mean. This might be useful if we had not already calculated.
A bit more about degrees of freedom
Degrees of freedom (abbreviated d.f. or df) are closely related to the idea of sample size. The greater the degrees of freedom associated with a test, the more likely it is to detect an effect if it’s present. To calculate the degrees of freedom, we start with the sample size, and then reduce this number by one for every quantity (e.g. a mean) we had to calculate to construct the test.
Calculating degrees of freedom for a one-sample t-test is easy. The degrees of freedom are just n-1, where n is the number of observations in the sample. We lose one degree of freedom because we have to calculate one sample mean to construct the test.
9.3.3 Summarising the result
Having obtained the result, we now need to write our conclusion. We are testing a scientific hypothesis, so we must always return to the original question to write the conclusion. In this case, the appropriate conclusion is:
The mean dry weight biomass of purple plants is significantly different from the expectation of 710 grams (t = 3.18, d.f. = 76, p < 0.01).
This is a concise and unambiguous statement in response to our initial question. The statement indicates the result of the statistical test and which value was used in the comparison. It is sometimes appropriate to give the values of the sample mean in the conclusion:
The mean dry weight biomass of purple plants (767 grams) is significantly different from the expectation of 710 grams (t = 3.18, d.f. = 76, p < 0.01).
Notice that we include details of the test in the conclusion. However, keep in mind that when writing scientific reports, the reporting of any statistical test should be a conclusion like the one above. Simply writing t = 3.18 or p < 0.01 is not an adequate.
There are a number of common questions that arise when presenting t-test results:
What do I do if t is negative? Don’t worry. A t-statistic can come out negative or positive, it simply depends on which order the two samples are entered into the analysis. Since the absolute value of t determines the p-value, when presenting the results, just ignore the minus sign and give t as a positive number.
How many significant figures for t? The t-statistic is conventionally given to 3 significant figures. This is because, in terms of the p-value generated, there is almost no difference between, say, t = 3.1811 and t = 3.18.
Upper or lower case The t statistic should always be written as lower case when writing it in a report (as in the conclusions above). Similarly, d.f. and p are always best as lower case. Some statistics we encounter later are written in upper case, but even with these, d.f. and p should be lower case.
How should I present p? There are various conventions in use for presenting p-values. We discussed these in the Hypotheses and p-values chapter. Learn them! It’s not possible to understand scientific papers or prepare reports properly without knowing these conventions.
p = 0.00? It’s impossible! p = 1e-16? What’s that?
Some statistics packages will sometimes give a probability of p = 0.00. This does not mean the probability was really zero. A probability of zero would mean an outcome was impossible, even though it happened! When a computer package reports p = 0.00 it just means that the probability was ‘very small’ and ended up being rounded down to 0.
R typically uses a slightly different convention for presenting small probabilities. A very small probability is given as p-value < 2.2e-16
. What does 2.2e-16 mean? This is R-speak for scientific notation, i.e. \(2.2e^{-16}\) is equivalent to \(2.2 \times 10^{-16}\).
It is fine to report a result like this as p < 0.001; definitely do not write something like p = 2.2e-16.
It’s hard to say what constitutes a ‘large’ sample. If we’re lucky enough to be working with hundreds of observations it’s fine to ignore mild departures from normality. If we’re working with 10s of observations we probably need to pay a bit more attention to the normality assumption in a t-test.↩︎
In reality, a frequentist confidence interval is more complicated than that but this book probably isn’t the right place to get into that discussion. Take our word for it though—confidence intervals are odd things.↩︎