When you can't randomise — difference-in-differences — Impact Evaluation Module 5

Most policy is not randomised — it is rolled out: a law passes in some states, a programme launches in some districts, a reform happens at a date. When you can't randomise, you turn to quasi-experimental methods that exploit this natural variation. The first and most widely used is difference-in-differences, which this module covers — including the recent realisation that the standard way of estimating it can be badly wrong.

The difference-in-differences idea

Differencing out trends and fixed differences

Difference-in-differences (DiD) compares the CHANGE in outcomes over time for a group that got the treatment to the CHANGE for a group that didn't. By taking the difference of differences, it removes two confounds at once: • Differencing over time (before vs after) within each group removes any FIXED differences between the groups (the treated region was always richer — that constant difference cancels). • Differencing across groups (treated vs control) removes any common TIME TRENDS (both regions grew because the economy grew — that common trend cancels). What's left — the difference in the changes — is attributed to the treatment. Concretely: DiD = (treated_after − treated_before) − (control_after − control_before). The classic example: Card and Krueger (1994) studied a minimum-wage increase in New Jersey, using neighbouring Pennsylvania (no increase) as the control, comparing the change in fast-food employment in each — finding (controversially) no employment loss. DiD is the workhorse of policy evaluation precisely because policies so often roll out in some places/times and not others, providing the treated-and-control, before-and-after structure it needs.

The parallel-trends assumption

The key identifying assumption

DiD's validity rests entirely on the PARALLEL-TRENDS assumption: ABSENT the treatment, the treated and control groups would have followed PARALLEL paths (the same trend). The treated group's counterfactual change is assumed to equal the control group's actual change. If this holds, the control's change is a valid counterfactual for the treated's change, and DiD gives the causal effect. If it FAILS — if the treated group was already on a different trajectory (e.g., the region that adopted the policy was already growing faster for other reasons) — then DiD attributes that pre-existing differential trend to the treatment, biasing the estimate. Parallel trends is fundamentally UNTESTABLE (it's about the unobserved counterfactual), BUT it can be made more or less credible by checking PRE-TRENDS: if the groups moved in parallel BEFORE the treatment (in the periods leading up to it), that supports (doesn't prove) the assumption that they'd have continued in parallel. Divergent pre-trends are a red flag that parallel trends likely fails. Assessing parallel trends (via pre-trends) is the central task in judging any DiD study — a DiD with divergent pre-trends is not credible.

Two-way fixed effects and event studies

DiD is typically estimated with a two-way fixed effects (TWFE) regression: regress the outcome on unit fixed effects (absorbing fixed differences between units), time fixed effects (absorbing common time shocks), and a treatment indicator (the coefficient on which is the DiD estimate). With two groups and two periods this exactly reproduces the simple DiD. The event-study specification extends this: instead of a single before/after, estimate the treatment effect in each period RELATIVE to the treatment date — producing a plot of effects over event time. The event-study plot is the workhorse diagnostic: the PRE-treatment coefficients should be near zero (no pre-trend — supporting parallel trends) and the POST-treatment coefficients trace out the DYNAMIC effect (does it grow, fade, persist?). A good DiD study always shows the event-study plot, because it simultaneously tests the identifying assumption (flat pre-trends) and reveals the effect's dynamics.

The staggered-adoption problem

When TWFE goes wrong

A major recent development (the 'DiD revolution' of the late 2010s) revealed that the standard TWFE regression can be SEVERELY BIASED when treatment is adopted at DIFFERENT TIMES by different units (staggered adoption — the common real-world case, e.g., states adopting a policy in different years). The problem (Goodman-Bacon, 2021): with staggered timing and effects that change over time, TWFE implicitly uses ALREADY-TREATED units as controls for LATER-treated units — a 'forbidden comparison' that can put NEGATIVE WEIGHTS on some treatment effects, so the TWFE estimate can be wrongly signed even when every unit's true effect is positive. This is not a minor technicality — many published DiD studies using TWFE with staggered adoption may be biased. The new estimators (Callaway-Sant'Anna, de Chaisemartin-D'Haultfœuille, Sun-Abraham, and others) fix this by avoiding the forbidden comparisons (using only clean never-treated or not-yet-treated controls and aggregating effects properly). The practical lesson: for staggered-adoption DiD, do NOT rely on naive TWFE — use the modern robust estimators, and be sceptical of older studies that used TWFE on staggered designs. This is one of the most important methodological developments of recent years and a live area where the credible practice has changed.

Exercise

A researcher evaluates a health insurance programme that was rolled out to different districts in different years, using a two-way fixed effects difference-in-differences with all other districts as controls. (1) Explain the DiD logic and what it's trying to difference out. (2) Explain the parallel-trends assumption and how the researcher should assess it. (3) Explain why the staggered rollout makes the naive TWFE estimate potentially unreliable. (4) Recommend how the researcher should estimate the effect credibly.