The course ends with the hardest step: from a clean causal estimate to an actual policy decision. A perfectly internally-valid RCT answers 'did this work HERE?' — but policy needs 'will this work THERE, at SCALE, and is it the BEST use of money?' These are different and harder questions, and treating an internally-valid estimate as a policy conclusion is a common and serious error. This module covers external validity, scale-up, cost-effectiveness, and the limits of the experimental approach.
Internal versus external validity
Works here vs works there
Internal validity: does the study correctly estimate the causal effect FOR THE STUDIED POPULATION AND CONTEXT? (The RCT's strength — randomisation delivers this.) External validity (generalisability): does the result HOLD in OTHER populations, contexts, times, or at scale? (The RCT does NOT automatically deliver this.) An intervention that worked in a randomised trial in one district, with one implementer, at one time, on one population may NOT work elsewhere — because the population differs (different needs, constraints), the context differs (different institutions, markets, complementary conditions), the implementer differs (an NGO's careful pilot vs a government's stretched bureaucracy), or the effect simply varies. The evidence (Vivalt) is that effect sizes VARY substantially across studies of the same intervention — generalisation is genuinely uncertain. So a credible internal estimate is NECESSARY but NOT SUFFICIENT for a policy decision: you must ask whether the conditions that made it work here will hold there. The leap from internal to external validity is where most policy mistakes happen — taking a result that's true HERE and assuming it's true everywhere.
The scale-up problem
Why pilots don't always scale
A specific and crucial external-validity problem: an intervention that works as a small PILOT may not work at SCALE, for several reasons. (1) General-equilibrium effects — a job-training programme that helps a few trainees get jobs may not help if EVERYONE is trained (they compete for the same limited jobs — the displacement spillover of module 4, now at scale); a small cash transfer may not move prices, but a large one might (raising local prices, eroding the benefit). The partial-equilibrium effect (a few treated) differs from the general-equilibrium effect (everyone treated). (2) Implementation/delivery — a pilot run by a motivated NGO with careful oversight may work, while the same programme delivered by an overstretched government bureaucracy at national scale fails (the implementation gap — the state-capacity problem of the Governance course). (3) Selection of context — pilots are often run in favourable settings (a willing district, a capable partner); scaling to average or hard settings dilutes the effect. (4) Market and behavioural responses that only appear at scale. So 'it worked in the RCT' does NOT mean 'it will work nationally' — the scale-up problem (Banerjee et al; the 'last mile' of evidence-based policy) is one of the central challenges of translating evidence into policy, and ignoring it (scaling a pilot naively) is a common, costly mistake. Anticipating general-equilibrium and implementation effects is essential before scaling.
Cost-effectiveness
Even an intervention that works (and generalises) may not be the BEST use of scarce funds — the policy question is comparative. Cost-effectiveness analysis ranks interventions by their cost per unit of outcome (cost per additional year of schooling, per case of disease averted, per life saved, per dollar of income gained). This is what lets a budget-constrained policymaker choose AMONG proven interventions — and the results can be startling: J-PAL and GiveWell-style cost-effectiveness comparisons found, for example, that deworming or providing information can deliver vastly more education per dollar than many more-expensive interventions, and that some popular programmes are far less cost-effective than alternatives. Cost-effectiveness (rather than cost-BENEFIT — the CBA course — where benefits are monetised) is often used in social sectors where monetising outcomes is hard. The key insight: 'it works' is not enough; the policy-relevant question is 'does it deliver more per dollar than the alternatives?' — so the evidence base is most useful when interventions are compared on cost-effectiveness, not just evaluated one at a time. This connects impact evaluation to the cost-benefit and MVPF tools of the Public Finance area: the goal is the best welfare per dollar, and rigorous effect estimates are an input to that.
The RCT critique and responsible use
Deaton-Cartwright and using evidence well
The experimental approach has powerful critics, and engaging them is part of using it responsibly. Angus Deaton and Nancy Cartwright ('Understanding and Misunderstanding RCTs', 2018) argue: (1) internal validity doesn't guarantee policy relevance (an unbiased estimate of an effect HERE tells you little about THERE without a theory of WHY it worked); (2) RCTs estimate an AVERAGE effect that may hide important heterogeneity and may apply to no actual individual; (3) without understanding MECHANISMS (why the intervention works), you can't predict whether it will transfer — so RCTs need THEORY, not just experiments; and (4) the 'gold standard' framing can crowd out other valuable evidence (structural models, observational studies, qualitative understanding) and important non-experimentable questions (you can't randomise a currency regime or an institution). The constructive response (which the field has largely absorbed): RCTs are a powerful tool, not the only one; combine experimental effect estimates with THEORY and MECHANISM (why does it work?), with cost-effectiveness comparison, and with judgement about external validity and scale-up; use the evidence base (J-PAL, what's been settled — CCTs, deworming, the limited impact of microcredit) as INPUTS to policy reasoning, not as automatic prescriptions; and retain humility about generalisation. The mature position is neither RCT-worship nor RCT-rejection but disciplined eclecticism: rigorous causal evidence, interpreted through theory and mechanism, compared on cost-effectiveness, and applied with explicit attention to external validity, scale, and context. That disciplined, humble use of evidence is the real lesson of the course — and the bridge from this methods specialization back to the substantive policy questions of the whole program.
Exercise
An RCT in one district finds that a new agricultural-extension programme (training farmers in better techniques) raised participating farmers' yields by 30%, cost-effectively. The agriculture minister wants to scale it nationally immediately. (1) Explain why the internally-valid 30% estimate is not sufficient to justify national scale-up. (2) Identify the general-equilibrium and implementation risks of scaling. (3) Explain how cost-effectiveness should inform the decision. (4) Apply the Deaton-Cartwright critique to advise the minister on using this evidence responsibly.