An RCT is only as good as its design, and a poorly designed trial can be worse than none — wasting resources and producing a misleading or uninformative result. This module covers the key design decisions: how big the sample must be (power), what to randomise (the unit), how to improve precision (stratification), and how to guard against fishing for results (the pre-analysis plan).
Power and sample size
Can the trial detect the effect?
Statistical power is the probability that the trial will detect a true effect of a given size (reject the null when it's false). An underpowered trial — too small a sample to reliably detect the effect — is a common and serious failure: it is likely to produce a non-significant result EVEN IF the treatment works, wasting the whole exercise and risking a false 'it doesn't work' conclusion. The related concept is the minimum detectable effect (MDE) — the smallest true effect the trial could reliably detect given its sample size. Power and the required sample size depend on: the expected effect size (smaller effects need bigger samples), the variance of the outcome (noisier outcomes need bigger samples), the significance level, and the desired power (conventionally 80%). The practical lesson: calculate power BEFORE running the trial, ensure the sample is large enough to detect a policy-relevant effect, and be honest that detecting small effects requires large (expensive) samples. A trial that can't detect a plausible effect shouldn't be run.
The unit of randomisation
Individual or cluster?
What do you randomise — individuals, or groups (villages, schools, clinics)? The choice is driven by SPILLOVERS. If the treatment can spill over from treated to untreated individuals WITHIN a group (deworming reduces infection for untreated children too; an information campaign spreads by word of mouth; a cash transfer affects the local economy), then randomising individuals within a village would CONTAMINATE the control group (the untreated are affected by their treated neighbours), biasing the estimate. The solution is CLUSTER randomisation: randomise whole groups (treat entire villages, leave others untreated), so the spillovers stay within treated clusters and the control clusters are clean. The cost: cluster randomisation is statistically less efficient — outcomes within a cluster are correlated (intra-cluster correlation), so a cluster of 100 provides far less information than 100 independent individuals (the 'design effect' shrinks the effective sample size), requiring MORE total units and reducing power. So the unit-of-randomisation choice trades off contamination from spillovers (favouring clusters) against statistical efficiency (favouring individuals) — and getting it wrong (randomising individuals when spillovers exist) biases the result.
Stratification and baseline data
Two design tools improve a trial. Stratification (blocking): before randomising, divide the sample into strata (by region, gender, baseline outcome) and randomise WITHIN each stratum — this guarantees balance on the stratifying variables (rather than relying on chance to balance them) and improves the precision of the estimate (especially valuable in small samples, where chance imbalance is a real risk). Baseline data: collecting outcomes BEFORE the treatment lets you (a) verify balance (check the randomisation worked — the groups are similar at baseline), (b) increase precision (controlling for the baseline outcome, or using a difference/ANCOVA specification, sharply reduces noise), and (c) study heterogeneous effects (does the treatment work better for some baseline groups?). Good baseline data and sensible stratification are cheap ways to make a trial more informative.
The pre-analysis plan
Tying your hands against fishing
A pre-analysis plan (PAP) specifies, IN ADVANCE and registered publicly (before seeing the outcome data), exactly what hypotheses you will test, what outcomes are primary, what subgroups you'll examine, and how you'll analyse the data. Why it matters: without it, a researcher facing many possible outcomes, subgroups, and specifications can (consciously or not) FISH — try many analyses and report the ones that 'work' (p-hacking, specification search), producing false positives that don't replicate (the replication/credibility crisis — the Measurement course). With many tests, some will be significant by chance alone. The PAP ties the researcher's hands: by committing to the analysis before seeing the data, it prevents cherry-picking and makes the resulting p-values meaningful (a pre-specified test that passes is real evidence; a post-hoc test discovered by searching is not). Pre-registration is now standard practice for credible RCTs and is part of the broader transparency-and-reproducibility movement (the Measurement course's final module). It distinguishes confirmatory analysis (pre-specified, credible) from exploratory analysis (hypothesis-generating, to be confirmed later) — both are valuable, but they must be honestly labelled.
Exercise
A government wants to evaluate a new teacher-training programme via RCT, measuring effects on student test scores. The training, if effective, would spread teaching practices among teachers in the same school. (1) Should it randomise individual teachers or whole schools, and why? (2) Explain the cost of that choice for sample size and power. (3) The expected effect is small; explain the power implication and one way to improve precision. (4) Explain why it should publish a pre-analysis plan, and what could go wrong without one.