Chapter 17 Principles of experimental design

17.1 Introduction

The data we use to test hypotheses may be generated by recording information from natural systems (‘observational studies’) or by carrying out some sort of experiment in which the system under study is manipulated in some way (‘experimental studies’). There is often considerable scope for deliberately arranging the system to generate data in the best way to test a particular effect when conducting experiments. For this reason, we tend to use the term ‘design’ primarily in the context of experiments. However, collecting data in both situations requires thought and planning, and many considerations of what is termed experimental design apply equally to observational and experimental studies¹².

The underlying principle of experimental design is: to extract data from a system so that variation in the data can be unambiguously attributed to the particular process we are investigating.

To do this, we need to know how to maximise the statistical power of an experiment or data collection protocol. Statistical power is the likelihood that a study will detect an effect when there really is an effect present. In statistics, the word ‘effect’ is an umbrella term for anything measurable we care about, like differences between groups or associations between variables. Statistical power is influenced by: (1) the size of the effect and (2) the size of the sample used to detect it. Bigger effects are easier to detect than smaller effects, while large samples present greater test sensitivity than small samples. A second consideration is that the less variable the data, the smaller the effects we can detect.

Given these facts, there are obviously two things to do when designing an experiment:

Use the maximum feasible sample sizes.
Take steps to minimise the variability in the data¹³.

Exactly what combination of these is appropriate will depend on the subject area. In a physiological experiment using complex apparatus and monitoring equipment, the scope for replication may be very limited. Here maximum effort should be put into experimentally controlling extraneous sources of variation. With the subject material, this may mean using animals of the same age, reared under the same conditions, of the same stock; it may involve using clones of plant material. It will involve running the experiment under controlled conditions of light and temperature and using measurement methods that are as precise as possible. On the other hand, an ecologist studying an organism in the field may have relatively little scope for experimental control of either the material studied or the environmental conditions and may be forced to make relatively crude measurements. In this case, the best approach is to control what can be controlled and then try and maximise the sample size.

17.2 Jargon busting

Before we delve further into experimental design concepts, we need to introduce a little bit of statistical jargon. We’ll define the terms and then run through an example to better understand them:

An experimental unit is the physical entity assigned to a treatment (see next definition). Examples of possible experimental units are individual clones or organisms.
A treatment is any kind of manipulation applied to experimental units. A group of experimental units that all receive the same treatment is called a treatment group.
Most experiments include one or more complementary groups, called control groups. The experimental units in a control group receive either no treatment or some kind of standard treatment.
An experimental factor is a collection of related treatments and controls, and the different treatments/controls are called the levels of that factor.

Here’s an example. Suppose we wanted to compare cattle weight gain on four different dietary supplements to determine which is the most effective. We conducted an experiment in which groups of eight cows are given a particular supplement for one month. A fifth group serves as the control group—they do not receive any supplements. At the end of the experiment, we measure how much weight each cow has gained over the month. In this example, individual cows are the experimental units, dietary supplements are the treatments, and the ‘no supplement’ group is the control group. Together, the four ‘supplement type’ and the ‘no diet’ control constitute the five levels of the ‘dietary supplement’ factor.

Finally, a word of warning—it is common to lump control groups and treatment groups together and just call them ‘treatments’. This is fine, but be aware of the distinction between the two.

17.3 Replication

We cannot do statistics without understanding the idea of replication—the process of assigning several experimental units to the same treatment or combination of treatments. Why does replication matter? Replication affects the power of a statistical test—by increasingly the replication in a study, we increase the sample size available to detect specific effects. Replication is particularly important in biology because the material we work with is often inherently variable and hard to make precise measurements on.

The basic idea is simple: increased replication = more statistical power. However, we have to be very careful about how we replicate…

17.3.1 Independence and pseudoreplication

Most statistical tests assume that the data are independent. Independence means that the value of a measurement from one object is not affected by the values of other objects. Common sources of non-independence in biology include:

genetics - e.g. if a set of mice are taken from the litter of a single female, they are more likely to be similar to each other than mice taken from the litters of several different females.
geography - e.g. samples from sites close together will experience similar microclimate, have similar soil type etc.
sampling within biological ‘units’ - e.g. leaves on a tree will be more similar to each other than to leaves from other trees.
experimental arrangements in the lab - e.g. plants are grown together in a pot or fish kept in one aquarium will all be affected by the conditions in that pot/aquarium.

Non-independence occurs at many levels in biological data, and in statistical testing, the common consequence of non-independence is pseudoreplication. Pseudoreplication is an artificial increase in the sample size caused by using non-independent data. It may be easiest to see what this means by example.

Imagine we are interested in whether plants of a particular species produce flowers with different numbers of petals when grown in two different soil types. We have three plants in each soil type, and each plant produces 4 flowers. If we count the petals in a single flower from each plant, and then test the difference using a t-test we get the following result:

Soil type		Num. Petals		Mean
	(Plant 1)	(Plant 2)	(Plant 3)
Soil type A	3	4	5	4
	(Plant 1)	(Plant 2)	(Plant 3)
Soil type B	4	5	6	5
					p = 0.29

The difference is not significant. Now imagine that instead of sampling a single flower from each plant, we counted the petals of all four flowers on each plant and (incorrectly) use all the values in the analysis (giving an apparent sample size of 12 in each treatment):

Soil type		Num. Petals		Mean
	(Plant 1)	(Plant 2)	(Plant 3)
Soil type A	3, 2, 3, 4	4, 4, 3, 5	3, 6, 7, 4	4
	(Plant 1)	(Plant 2)	(Plant 3)
Soil type B	4, 5, 4, 3	5, 7, 3, 5	6, 5, 7, 6	5
					p = 0.009

The same difference in the means now appears to be highly significant! The problem here is that the flowers within each plant are not independent - there is variation among plants in petal numbers, but within the plant (perhaps for genetic reasons), the number of petals produced is similar. Because of this non-independence, the apparent significance in the final result is spurious. There are only three independent entities in each soil type treatment—the plants—so the first of the two tests here is correct. The second is pseudoreplicated.

To illustrate the effect in a still more obvious way, consider if we were interested in the heights of plants in the two soil types but we only had one plant in Soil A and one in Soil B. If we measure the plants and find they differ somewhat in height, we cannot tell whether this is due to the soil, or just because no two plants are identical. With one plant in each soil, we cannot carry out a statistical test to compare the heights. Now, if it was suggested that we measure the height of each plant 20 times and then used those numbers to do a statistical test to compare the plant heights in the two soils, we would realise that this was an entirely pointless exercise.

There is no more information about the effect of soil type in the two sets of 20 measurements than in the single measurement (except we now know how variable our measuring technique is). And why stop at 20? Why not just keep remeasuring until we have enough numbers to get a significant difference?!

The pitfall of pseudoreplication may seem obvious. However, it can creep into biological studies in quite subtle ways and occurs in a significant number of published studies. One very common problem occurs in ecological studies where different habitats, or experimental plots, are being compared. Say we are looking at zooplankton abundance in two lakes, one with fish and one without. We would normally take a number of samples from each lake and could obviously compare the zooplankton numbers between these two sets of samples. It would be tempting to attribute any differences we observe to the effect of fish. However, this would not be correct.

We have measured the difference in zooplankton between the two lakes (and this is quite a valid thing to do), but the lakes may differ in any number of ways, not just the presence of fish, so it is not correct to interpret our result in relation to the effect of fish. To do this, we would need data on zooplankton abundance in several lakes with fish and several without. In other words, for testing the effect of fish, our replicates should be whole lakes with and without the relevant factor (fish), not samples from within a single lake.

Surely it is still better to take lots of samples from each site than just one; it must give a more accurate picture? This is true. Taking several measurements or samples from each object guards against the possibility of the results being influenced by a single, possibly unusual, sample. It would be much more reliable to have twenty zooplankton samples from a lake than just one. This is important, but it is not the same as having measurements from more objects (lakes)—true replication—which increases the power of the statistical test to detect differences among experimental units with respect to the particular factor (e.g. fish / no fish) we are interested in.

In summary, when carrying out an investigation the key question to ask is: What is the biological unit of replication relevant to the effect we trying to test? As this implies, the appropriate unit of replication may vary depending on what we are investigating. If we want to test for a difference in the plankton density between two lakes, then taking 10 samples from each lake and comparing them would be the correct approach. But if, as above, we wanted to assess the effect of fish on plankton density, it would be inappropriate—the correct unit of replication in this case is the whole lake and we would therefore want to sample several lakes with and without fish.

17.4 Controls

We are told repeatedly, probably starting at primary school, that every experiment must have a control—a reference treatment against which the other treatments can be compared. The idea does sometimes generate confusion since some experiments do not require a control, whereas others may require more than one control. What’s more, it can be difficult to pin down what to control for.

In some cases, the appropriate control is obvious. In a toxicity test, we are interested in the mortality due to the toxicant. We want the control to tell us the background mortality rate without toxicants. However, if we measure the movement rates of slugs on surfaces of differing moisture content, no control is required—indeed, none possible. Slugs encounter many different moisture conditions in their daily lives, and there isn’t a ‘control’ moisture level.

More tricky is the situation where the objects we are investigating are affected not just by the treatment we are administering but also by other effects of applying that treatment. The use of control treatments can sometimes address this, but these are now not simply the ‘natural’ situation. They may have to be specifically designed to mimic certain aspects of the experiment, not others. These sorts of controls are discussed in more detail below.

17.5 Confounded and noisy experiments

Unwanted variation comes in two forms:

The first is confounding variation. This occurs when one or more other sources of variation work in parallel to the factor we are investigating and make it hard, or impossible, to attribute any effects we see to a single cause unambiguously. Confounding variation is particularly problematic in observational studies because, by definition, we don’t manipulate the factors we’re interested in.
The second is noise. This describes variation that is unrelated to the factor we are investigating. Noise adds variability to the results so that it is harder to see and detect statistically any effect of that factor. As noted above, much of experimental design is about improving our ability to account for noise in a statistical analysis.

We will consider these together because some of the techniques for dealing with them apply to both.

17.5.1 Confounding

The potential for confounding effects may sometimes be easy to recognise. Suppose we measure growth rates in plants growing at sites of differing altitudes. In that case, several factors all change systematically with altitude (temperature, ultraviolet radiation, precipitation, wind speed, etc.), and it may be hard to use such data to examine the effects of any one of these factors alone. The important thing to remember is that observing a relationship between two variables does not necessarily indicate a causal link. A negative relationship between plant growth and increased precipitation up a mountain may be determined by one or more of the other factors that vary with altitude.

Confounding doesn’t just occur in observational studies. Confounding occurs when the administration of a treatment itself generates other unwanted effects—this is called procedural confounding. An example might be in the administration of nutrients to plants. Changing the supply of nitrogen may be done by supplying different levels of a nitrate (NO3) salt (e.g. Mg(NO3)2 or Ca(NO3)2), but how can we be sure that the effects we see are a consequence of nitrogen addition, rather than effects of the magnesium or calcium cations?

17.5.2 Noise

Noise in the data can be generated by the same processes that generate confounding. The difference is that noise is generated even when the confounding factors don’t align with the treatments. So, going back to measuring growth rates in plants, if we were looking at growth rates of different subspecies of plant on a mountain, then we might find that we can get five samples from each different subspecies, but the samples are scattered across very different altitudes on the mountain. This will add variation to the estimates of growth rate—this is unwanted noise. On the other hand, if the subspecies grow predominantly at different altitudes, the variation due to altitude is confounded with the variation due to subspecies.

17.6 Dealing with confounding effects and noise

Confounding effects occur often in biological work and noise of some sort is always present. Techniques for dealing with such effects include: randomisation, blocking, experimental control, and additional treatments.

17.6.1 Randomisation

Randomisation is fundamental to experimental design. Although we can identify specific confounding factors and explicitly counter using experimental techniques, we can never anticipate all such factors. Randomisation provides an ‘insurance’ against the unpredictable confounding effects encountered in experiments. The basic principle is that each experimental unit should be selected or allocated to a particular treatment, ‘at random’. This may involve selecting which patients to give a drug and which a placebo at random, or it may include setting out experimental plots at random locations in a field. The important thing is that those who get a particular treatment are randomly selected from all the possible patients or plots.

Randomisation guards against a variety of possible biases and confounding effects, including the inadvertent biases that might be introduced simply in the process of setting up an experiment. For example, if in a toxicological experiment the chemical treatment is set up first and then the control, it may be that the animals caught most easily from the stock tank (the largest? the weakest?) will all end up in the chemical treatment and the remainder in the control, with consequent bias in the death rates observed in the subsequent experiment.

Randomisation is a critical method for guarding against confounding effects. It is the best insurance we have against unwittingly getting another factor working parallel to a treatment. It does not, of course, do anything to reduce noise in the data. In fact, if randomisation removes confounding effectively, it can appear to increase that variation—but it is a necessary cost to pay for being able to interpret treatment effects correctly.

What does ‘at random’ mean?

The random bit of the word randomisation has a specific meaning: objects chosen ‘at random’ are chosen independently with equal probabilities. How do we achieve this in practice? First, we need a set of random numbers. For example, if we need to assign 10 experimental units to treatments, we might start with a set of random integers: 4, 3, 5, 8, 7, 1, 10, 9, 6, 2 (attaining such a set is easy with R — e.g. sample(1:10)).

Exactly how these numbers are used in setting up the experiment will depend on what is practical. For example, in the toxicological experiment, we might place animals in each of the test containers to be used for the experiment, number each container and then use the first half of the set of random numbers to randomly select half the containers to be the test and use the remainder as the controls.

17.6.2 Blocking

Another way of tackling potential confounding effects, and the general heterogeneity of biological material leading to noise, is to organise experimental material into ‘blocks’. This technique, called blocking, is arguably the most important experimental design concept after replication. It works as follows:

Group the objects being studied into blocks such that variation among objects within blocks is small; variation between blocks may be larger.
Each treatment should occur at least once within each block¹⁴.

For example, in an experiment in which mice are reared on three different diets (I, II, III), we might expect the responses of mice from within a particular litter to be fairly similar to each other, but they might be rather different to the responses of mice from different litters. If we have five litters of mice (A … E) it would be sensible to select three mice from each litter (at random) to be allocated to each treatment.

I	\(A_{1}\)	\(B_{1}\)	\(C_{1}\)	\(D_{1}\)	\(E_{1}\)
II	\(A_{2}\)	\(B_{2}\)	\(C_{2}\)	\(D_{2}\)	\(E_{2}\)
II	\(A_{3}\)	\(B_{3}\)	\(C_{3}\)	\(D_{3}\)	\(E_{3}\)

Here, \(A_{1}\) denotes the first randomly chosen animal from litter \(A\), \(A_{2}\) denotes the second randomly chosen animal from litter \(A\), and \(A_{3}\) denotes the third randomly chosen animal from litter \(A\). Blocking the design like this is guaranteed to increase the power of the experiment to detect effects of the treatments when there are differences between litters.

In the case of only two treatments (e.g. if we just had diets I and II), this type of blocking is simply the pairing of treatments we have encountered in the paired-sample t-test. Blocked designs with more than two blocks are typically analysed using Analysis of Variance (ANOVA). We will learn how to apply ANOVA to a blocked experimental design this in later chapters.

Note that randomisation is important here also. Mice were selected at random from each litter to be allocated to each treatment, and litters are essentially ‘random’ in the sense that they are not deliberately chosen to be different in any particular way. We just anticipate that they are likely to be different in some ways.

Blocking crops up in all sorts of experimental (and non-experimental) study designs. Some examples are given below.

If plants in an experiment on soil water levels are being grown in pots on greenhouse benches, there may be differences in light or temperature at differing distances from the glass. Treatments could be blocked along the gradient—at each position on the bench we have one pot from each treatment. This way, every treatment is represented at each position along the gradient.
If a field experiment involving several treatments is set up in an environment known to have some spatial variation (e.g., different parts of a field, sections of a river, etc.) setting up one replicate of each treatment in blocks at different locations ensures that no one treatment ends up confounded by some environmental difference, and helps remove noise due to environmental effects in the final analysis.
An immunity response is being tested using insects kept in a parallel set of laboratory cultures. There are insufficient insects from a single culture to run the whole experiment, so we could set up one replicate of each treatment using insects from each culture. The cultures would be the blocks. We are not interested in the differences between cultures but we want to be able to control and remove variation due to differences between them.
If the process of collecting and analysing samples from an experiment is very time consuming, we might block the experiment in time. Set up one replicate of each treatment on each of a sequence of days, and then collect the samples after a particular time, again over the same sequence of days. Each replicate has then been run for the same length of time (we would randomize the order in which treatments were sampled each day), and we could then include ‘days’ as a block within the analysis to control for any unknown differences resulting from the different setup, or sample days.

It’s worth saying again: blocking is one of the most important experimental design concepts. Many experimental settings lend themselves to some kind of blocking scheme. If there is a way to block an experiment, we should do it. Why? Because a study is more likely to detect an effect if it uses a blocked design compared to the equivalent non-blocked version.

17.6.3 Experimental control

Obviously, some unwanted variation in data will arise if there is poor measurement or careless implementation of the treatments. In every study we do, we should look at the ‘protocol’ issues and see if they can be improved. This means considering the precision of the measurements we are making in relation to the size of effects we are interested in. There would be no point in timing measurement intervals over which seedling growth was determined to the millisecond, but it would be good to measure seedling height using a standard approach and to the nearest millimetre, rather than centimetre.

The second form of experimental control is where we can use experimental manipulation of some sort to control for factors that might vary among replicates or treatments. At its simplest, this involves controlling the other conditions (e.g. temperature) so that all treatments experience identical conditions. It may not always be necessary for the conditions to be constant—it may be sufficient that whatever variation occurs is the same for all treatments.

More complex problems arise where the unwanted variation is produced as a by-product of the treatment we administer (procedural confounding again). Suppose we were interested in the effect of leaf litter decomposition on the microbial communities in soils. In that case, we might have an experimental treatment that involves varying the amount of leaf litter placed on the soil surface in test plots. The problem is that this will vary not just the amount of decomposing material entering the soil, but also the physical presence of the leaf litter layer will affect the microclimate at the soil surface (e.g. the dryness in the surface of the soil). So we might create some sort of artificial litter that can be mixed in with the real litter, but which does not decompose so that each plot has a constant volume of ‘litter’ on the surface, but different amounts of decomposing material entering the soil.

Other situations in which this type of experimental ‘adjustment’ can be used include experiments in which different nutrient solutions have to be adjusted so that they have the same pH or where different temperature treatments have to have humidity adjusted to ensure that it remains constant. In general, this type of approach can be very useful, but it depends on the necessary adjustment being known and sometimes requires continuous monitoring to keep the adjustments correct.

17.6.4 Additional treatments: ‘designing in’ unwanted variation

Often we are faced by situations in which the unwanted variation — in particular confounding effects — cannot be removed by manipulating the treatments themselves, but has to be tackled by creating additional treatments whose function is to measure the extent of the unwanted variation, and then allow us to remove it statistically, from the data after the experiment is done. In other words, instead of just designing the experiment with the factor we are interested in, we ‘design in’ the sources of unwanted variation.

17.6.4.1 Transplants and cross-factoring

Imagine we had an investigation that involved looking at the effects of air pollution on the ability of trees to defend themselves chemically against attack by leaf-mining insects. The obvious thing to do would be to look at trees along a gradient of air pollution and monitor leaf damage by the insects. We might find that the insects attack the trees more in polluted areas. However, the problem here is that the trees growing in areas of high air pollution might be attacked more because they are stressed and less able to invest resources in defending themselves (as hypothesised), or because the insects own natural enemies are less abundant in areas of high air pollution, leading to reduced suppression. One way of escaping this confounding effect would be to take tree saplings from polluted and unpolluted areas and do reciprocal transplants—moving trees from polluted areas into clean areas, and vice versa. This then enables us to separate out to a large extent the effect of tree quality from the effect of insect abundance as we can compare trees that have grown with and without air pollution in both polluted and unpolluted areas.

It is also possible that by careful choice of location, or other elements of design, we can include the unwanted variation as an additional factor in the design without necessarily physically manipulating the subjects, but by sampling material systematically with regard to both the thing we are interested in and the additional unwanted factor(s), so that we can cross-factor the two. For example, suppose we were interested in how habitat use determines gut parasite load in dogfish. In that case, we might sample dogfish from different habitats and record the sex, age, or size of the fish. It would then be possible to separate the effects of sex or age from those of where the fish were living. If we didn’t do this, then both factors would probably contribute unwanted variation or confounding effects (e.g. male and female dogfish have somewhat different habitat preferences).

17.6.4.2 Procedural controls

Confounding effects are not only a problem along natural gradients. The experimental procedures can also introduce them. For example, a marine biologist investigating the effect of crab predation on the density of bivalve molluscs in an estuarine ecosystem might have cages on the mudflats from which crabs are removed and in which any change in bivalve settlement and survival can be monitored. The obvious control for this would be equivalent plots on the adjacent mudflats with normal crab numbers. Suppose the experiment just compares the bivalve density in cages with reduced crab numbers to their density in the adjacent mudflat. In that case, any effects observed could be attributable to crab density, environmental changes brought about by the cages, or disturbance due to the repeated netting to remove crabs. Several additional controls might be useful here. In addition to the proper treatment, bivalve density could be monitored in:

a ‘no cage / no disturbance control’—open mudflat adjacent to the experiment (so no cage effects, no added disturbance).
a ‘cage control’—crabs at normal density but with a cage (usually done as cage with openings to allow crabs to enter and leave).
a ‘disturbance control’—crabs at normal densities, but subject to the same disturbance as the reduced density treatments (cages netted to remove crabs, but all crabs returned to the cages)

The latter two could be combined if it wasn’t important to separate disturbance and cage effects.

The additional treatments in this sort of situation are effectively additional controls—in fact, they are termed procedural controls—but they are not simply the natural ‘background’ conditions. A classic example of this type of control is the use of placebo treatments in medical trials. For example, if we are investigating the effect of a drug, there may be a confounding effect due to psychological, behavioural or even physiological changes in patients resulting simply from being treated, rather than any active compound in the drug. Therefore, it is common to give the drug to one group of patients and a ‘placebo’ to another group. The placebo is a secondary manipulation designed to equalise the effect of simply ‘being treated’ (i.e. equivalent treatment process, but with no active component in the substance administered).

17.7 Ethics and practicality

Although experimental design is often fairly straightforward in principle, the ideal design to test an hypothesis may turn out to be impractical, unaffordable or unethical. All experiments are constrained by practicality and finance. Ethical considerations also constrain a rather smaller but important set. Ethical factors constrain experiments in every biological discipline. However, nowhere is the issue more pronounced than in medicine.

Drug testing presents the classic difficulty. Effective testing of a drug’s efficacy depends on comparing patients receiving the drug with closely equivalent patients not doing so, or receiving some alternative treatment. Since one of the treatments will likely be better than another, by definition, at least one group of people have an available and better treatment withheld from them (e.g., Aspinal and Goodman 1995). Thus, as soon as the experimental evidence indicates which treatment is best, it is very hard to justify withholding it from all patients, even if the experimenter feels that further work is necessary.

Good experimental design and appropriate analysis cannot remove ethical, practical or financial problems. Still, they can help ensure that the maximum useful information is returned where time and money are invested in a problem.

It is worth noting that in reports experiment and observation should always be distinguished. If we have carried out observations on a natural system of any sort, but where there has been no experimental manipulation of any aspect of the system, that is not an experiment. It would be inappropriate to write in a report: “This experiment consisted of measuring mean stomatal density from thirty trees growing at a range of altitudes.” Instead, we might write: “We conducted an observational study measuring mean stomatal density from thirty trees growing at a range of altitudes.”↩︎
This variability could be due to all kinds of things: the organisms/material being used; of the experimental conditions; and of the methods of measurement.↩︎
Actually, there are special types of experimental design that use blocking, but where each treatment does not appear in every block. These are much more advanced than anything we will cover in this book.↩︎