Skip to content
Module 08 of 850 min readAdvanced

From estimate to policy

External validity, the scale-up problem, cost-effectiveness comparison, and using the J-PAL evidence base responsibly.

100%

Listen along

Read “From estimate to policy” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Distinguish internal from external validity
  • 02Explain the scale-up problem and general-equilibrium effects
  • 03Use cost-effectiveness to compare interventions
  • 04Apply the RCT critique and use the evidence base responsibly

The course ends with the hardest step: from a clean causal estimate to an actual policy decision. A perfectly internally-valid RCT answers 'did this work HERE?' — but policy needs 'will this work THERE, at SCALE, and is it the BEST use of money?' These are different and harder questions, and treating an internally-valid estimate as a policy conclusion is a common and serious error. This module covers external validity, scale-up, cost-effectiveness, and the limits of the experimental approach.

Internal versus external validity

Works here vs works there

Internal validity: does the study correctly estimate the causal effect FOR THE STUDIED POPULATION AND CONTEXT? (The RCT's strength — randomisation delivers this.) External validity (generalisability): does the result HOLD in OTHER populations, contexts, times, or at scale? (The RCT does NOT automatically deliver this.) An intervention that worked in a randomised trial in one district, with one implementer, at one time, on one population may NOT work elsewhere — because the population differs (different needs, constraints), the context differs (different institutions, markets, complementary conditions), the implementer differs (an NGO's careful pilot vs a government's stretched bureaucracy), or the effect simply varies. The evidence (Vivalt) is that effect sizes VARY substantially across studies of the same intervention — generalisation is genuinely uncertain. So a credible internal estimate is NECESSARY but NOT SUFFICIENT for a policy decision: you must ask whether the conditions that made it work here will hold there. The leap from internal to external validity is where most policy mistakes happen — taking a result that's true HERE and assuming it's true everywhere.

The scale-up problem

Why pilots don't always scale

A specific and crucial external-validity problem: an intervention that works as a small PILOT may not work at SCALE, for several reasons. (1) General-equilibrium effects — a job-training programme that helps a few trainees get jobs may not help if EVERYONE is trained (they compete for the same limited jobs — the displacement spillover of module 4, now at scale); a small cash transfer may not move prices, but a large one might (raising local prices, eroding the benefit). The partial-equilibrium effect (a few treated) differs from the general-equilibrium effect (everyone treated). (2) Implementation/delivery — a pilot run by a motivated NGO with careful oversight may work, while the same programme delivered by an overstretched government bureaucracy at national scale fails (the implementation gap — the state-capacity problem of the Governance course). (3) Selection of context — pilots are often run in favourable settings (a willing district, a capable partner); scaling to average or hard settings dilutes the effect. (4) Market and behavioural responses that only appear at scale. So 'it worked in the RCT' does NOT mean 'it will work nationally' — the scale-up problem (Banerjee et al; the 'last mile' of evidence-based policy) is one of the central challenges of translating evidence into policy, and ignoring it (scaling a pilot naively) is a common, costly mistake. Anticipating general-equilibrium and implementation effects is essential before scaling.

Cost-effectiveness

Even an intervention that works (and generalises) may not be the BEST use of scarce funds — the policy question is comparative. Cost-effectiveness analysis ranks interventions by their cost per unit of outcome (cost per additional year of schooling, per case of disease averted, per life saved, per dollar of income gained). This is what lets a budget-constrained policymaker choose AMONG proven interventions — and the results can be startling: J-PAL and GiveWell-style cost-effectiveness comparisons found, for example, that deworming or providing information can deliver vastly more education per dollar than many more-expensive interventions, and that some popular programmes are far less cost-effective than alternatives. Cost-effectiveness (rather than cost-BENEFIT — the CBA course — where benefits are monetised) is often used in social sectors where monetising outcomes is hard. The key insight: 'it works' is not enough; the policy-relevant question is 'does it deliver more per dollar than the alternatives?' — so the evidence base is most useful when interventions are compared on cost-effectiveness, not just evaluated one at a time. This connects impact evaluation to the cost-benefit and MVPF tools of the Public Finance area: the goal is the best welfare per dollar, and rigorous effect estimates are an input to that.

The RCT critique and responsible use

Deaton-Cartwright and using evidence well

The experimental approach has powerful critics, and engaging them is part of using it responsibly. Angus Deaton and Nancy Cartwright ('Understanding and Misunderstanding RCTs', 2018) argue: (1) internal validity doesn't guarantee policy relevance (an unbiased estimate of an effect HERE tells you little about THERE without a theory of WHY it worked); (2) RCTs estimate an AVERAGE effect that may hide important heterogeneity and may apply to no actual individual; (3) without understanding MECHANISMS (why the intervention works), you can't predict whether it will transfer — so RCTs need THEORY, not just experiments; and (4) the 'gold standard' framing can crowd out other valuable evidence (structural models, observational studies, qualitative understanding) and important non-experimentable questions (you can't randomise a currency regime or an institution). The constructive response (which the field has largely absorbed): RCTs are a powerful tool, not the only one; combine experimental effect estimates with THEORY and MECHANISM (why does it work?), with cost-effectiveness comparison, and with judgement about external validity and scale-up; use the evidence base (J-PAL, what's been settled — CCTs, deworming, the limited impact of microcredit) as INPUTS to policy reasoning, not as automatic prescriptions; and retain humility about generalisation. The mature position is neither RCT-worship nor RCT-rejection but disciplined eclecticism: rigorous causal evidence, interpreted through theory and mechanism, compared on cost-effectiveness, and applied with explicit attention to external validity, scale, and context. That disciplined, humble use of evidence is the real lesson of the course — and the bridge from this methods specialization back to the substantive policy questions of the whole program.

Exercise

An RCT in one district finds that a new agricultural-extension programme (training farmers in better techniques) raised participating farmers' yields by 30%, cost-effectively. The agriculture minister wants to scale it nationally immediately. (1) Explain why the internally-valid 30% estimate is not sufficient to justify national scale-up. (2) Identify the general-equilibrium and implementation risks of scaling. (3) Explain how cost-effectiveness should inform the decision. (4) Apply the Deaton-Cartwright critique to advise the minister on using this evidence responsibly.

Key takeaways

  • Internal validity (correct effect for the studied context — the RCT's strength) is necessary but NOT sufficient for policy; external validity (does it hold elsewhere, at scale?) is a separate, harder question, and effect sizes vary substantially across contexts (Vivalt)
  • The scale-up problem: pilots may not work at scale because of general-equilibrium effects (everyone treated competes/moves prices), implementation gaps (government vs pilot implementer), and favourable-context selection
  • 'It works' is not enough — cost-effectiveness (cost per unit of outcome) is what lets a budget-constrained policymaker compare among proven interventions and choose the best buy; re-assess it at realistic scale
  • The Deaton-Cartwright critique: internal validity ≠ policy relevance without a theory of WHY it works (mechanism); RCTs need theory, hide heterogeneity, and can't address non-experimentable questions — the gold-standard framing can crowd out other evidence
  • The mature position is disciplined eclecticism: rigorous causal evidence interpreted through theory and mechanism, compared on cost-effectiveness, applied with explicit attention to external validity, scale, and context — the bridge from methods back to policy

Further reading

  1. 01

    Understanding and Misunderstanding Randomized Controlled Trials

    Angus Deaton & Nancy Cartwright · Social Science & Medicine 210 · 2018The most influential critique of RCTs — internal vs external validity, the need for mechanisms and theory. Essential for using evidence responsibly.

  2. 02

    How Much Can We Generalize from Impact Evaluations?

    Eva Vivalt · Journal of the European Economic Association 18(6) · 2020The evidence that effect sizes vary substantially across contexts. The external-validity problem, quantified.

  3. 03

    From Proof of Concept to Scalable Policies: Challenges and Solutions

    Abhijit Banerjee, Rukmini Banerji, James Berry et al. · Journal of Economic Perspectives 31(4) · 2017The scale-up problem and how to address it. The bridge from RCT evidence to policy at scale.

Loading progress…
LeadAfrikPublic Economics Hub