Skip to content
Module 03 of 855 min readAdvanced

Designing an RCT

Power calculations and minimum detectable effects, the unit of randomisation, stratification, and the pre-analysis plan.

38%

Listen along

Read “Designing an RCT” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Explain statistical power and the minimum detectable effect
  • 02Choose the unit of randomisation given spillovers
  • 03Explain stratification and the role of baseline data
  • 04Explain the pre-analysis plan and why it matters

An RCT is only as good as its design, and a poorly designed trial can be worse than none — wasting resources and producing a misleading or uninformative result. This module covers the key design decisions: how big the sample must be (power), what to randomise (the unit), how to improve precision (stratification), and how to guard against fishing for results (the pre-analysis plan).

Power and sample size

Can the trial detect the effect?

Statistical power is the probability that the trial will detect a true effect of a given size (reject the null when it's false). An underpowered trial — too small a sample to reliably detect the effect — is a common and serious failure: it is likely to produce a non-significant result EVEN IF the treatment works, wasting the whole exercise and risking a false 'it doesn't work' conclusion. The related concept is the minimum detectable effect (MDE) — the smallest true effect the trial could reliably detect given its sample size. Power and the required sample size depend on: the expected effect size (smaller effects need bigger samples), the variance of the outcome (noisier outcomes need bigger samples), the significance level, and the desired power (conventionally 80%). The practical lesson: calculate power BEFORE running the trial, ensure the sample is large enough to detect a policy-relevant effect, and be honest that detecting small effects requires large (expensive) samples. A trial that can't detect a plausible effect shouldn't be run.

The unit of randomisation

Individual or cluster?

What do you randomise — individuals, or groups (villages, schools, clinics)? The choice is driven by SPILLOVERS. If the treatment can spill over from treated to untreated individuals WITHIN a group (deworming reduces infection for untreated children too; an information campaign spreads by word of mouth; a cash transfer affects the local economy), then randomising individuals within a village would CONTAMINATE the control group (the untreated are affected by their treated neighbours), biasing the estimate. The solution is CLUSTER randomisation: randomise whole groups (treat entire villages, leave others untreated), so the spillovers stay within treated clusters and the control clusters are clean. The cost: cluster randomisation is statistically less efficient — outcomes within a cluster are correlated (intra-cluster correlation), so a cluster of 100 provides far less information than 100 independent individuals (the 'design effect' shrinks the effective sample size), requiring MORE total units and reducing power. So the unit-of-randomisation choice trades off contamination from spillovers (favouring clusters) against statistical efficiency (favouring individuals) — and getting it wrong (randomising individuals when spillovers exist) biases the result.

Stratification and baseline data

Two design tools improve a trial. Stratification (blocking): before randomising, divide the sample into strata (by region, gender, baseline outcome) and randomise WITHIN each stratum — this guarantees balance on the stratifying variables (rather than relying on chance to balance them) and improves the precision of the estimate (especially valuable in small samples, where chance imbalance is a real risk). Baseline data: collecting outcomes BEFORE the treatment lets you (a) verify balance (check the randomisation worked — the groups are similar at baseline), (b) increase precision (controlling for the baseline outcome, or using a difference/ANCOVA specification, sharply reduces noise), and (c) study heterogeneous effects (does the treatment work better for some baseline groups?). Good baseline data and sensible stratification are cheap ways to make a trial more informative.

The pre-analysis plan

Tying your hands against fishing

A pre-analysis plan (PAP) specifies, IN ADVANCE and registered publicly (before seeing the outcome data), exactly what hypotheses you will test, what outcomes are primary, what subgroups you'll examine, and how you'll analyse the data. Why it matters: without it, a researcher facing many possible outcomes, subgroups, and specifications can (consciously or not) FISH — try many analyses and report the ones that 'work' (p-hacking, specification search), producing false positives that don't replicate (the replication/credibility crisis — the Measurement course). With many tests, some will be significant by chance alone. The PAP ties the researcher's hands: by committing to the analysis before seeing the data, it prevents cherry-picking and makes the resulting p-values meaningful (a pre-specified test that passes is real evidence; a post-hoc test discovered by searching is not). Pre-registration is now standard practice for credible RCTs and is part of the broader transparency-and-reproducibility movement (the Measurement course's final module). It distinguishes confirmatory analysis (pre-specified, credible) from exploratory analysis (hypothesis-generating, to be confirmed later) — both are valuable, but they must be honestly labelled.

Exercise

A government wants to evaluate a new teacher-training programme via RCT, measuring effects on student test scores. The training, if effective, would spread teaching practices among teachers in the same school. (1) Should it randomise individual teachers or whole schools, and why? (2) Explain the cost of that choice for sample size and power. (3) The expected effect is small; explain the power implication and one way to improve precision. (4) Explain why it should publish a pre-analysis plan, and what could go wrong without one.

Key takeaways

  • Statistical power is the probability of detecting a true effect; underpowered trials likely return non-significant results even when the treatment works — calculate power (and the minimum detectable effect) before running
  • The unit of randomisation is driven by spillovers: if treatment spills over within groups, randomise CLUSTERS (whole villages/schools) to keep contamination within treated clusters — at the cost of efficiency (intra-cluster correlation shrinks the effective sample, needing more clusters)
  • Stratification (randomise within strata) guarantees balance and improves precision; baseline data verifies balance, sharply cuts noise (control for the baseline outcome), and enables heterogeneity analysis
  • A pre-analysis plan specifies the hypotheses and analysis in advance (registered before seeing the data) — tying the researcher's hands against p-hacking/specification search, making p-values meaningful
  • Pre-registration distinguishes confirmatory (pre-specified, credible) from exploratory (hypothesis-generating) analysis — the response to the replication crisis

Further reading

  1. 01

    Running Randomized Evaluations: A Practical Guide

    Rachel Glennerster & Kudzai Takavarasha · Princeton University Press · 2013The hands-on guide to designing and running an RCT — power, randomisation, and implementation. The practitioner's manual.

  2. 02

    Using Randomization in Development Economics Research: A Toolkit

    Esther Duflo, Rachel Glennerster & Michael Kremer · Handbook of Development Economics · 2007The conceptual treatment of design choices — unit of randomisation, power, stratification. The reference.

  3. 03

    Pre-Analysis Plans and the Credibility of Research

    Benjamin Olken / Christensen & Miguel · Journal of Economic Perspectives / JEL · 2015Why pre-registration matters and how it works. The case for tying your hands.

Loading progress…
LeadAfrikPublic Economics Hub