Sometimes you have only observational data and no threshold or natural experiment — just a treated group and an untreated group that differ. This module covers the two main strategies for these hardest cases: matching (which assumes you can adjust away the differences you can see) and instrumental variables (which can handle the differences you can't see, IF you can find a valid instrument). It ends by placing all the course's methods in a credibility hierarchy.
Matching and selection on observables
Adjusting for what you can see
Matching tackles selection bias by comparing treated and untreated units that are SIMILAR on OBSERVABLE characteristics. The idea: if a treated and an untreated unit have the same age, education, prior earnings, location, etc., then (ASSUMING those observables capture all the relevant differences) the untreated unit is a valid counterfactual for the treated one. Propensity-score matching simplifies this: instead of matching on many characteristics at once, estimate each unit's PROBABILITY of being treated (the propensity score) given its observables, and match treated to untreated units with similar propensity scores (Rosenbaum-Rubin). The identifying assumption is CONDITIONAL INDEPENDENCE (selection on observables / unconfoundedness): conditional on the observed characteristics, treatment is as-good-as-random — i.e., there are NO UNOBSERVED differences between treated and untreated units (given the observables) that affect the outcome. If that holds, matching removes selection bias. Matching is intuitive and widely used — but it lives or dies by that assumption.
The fatal weakness
Selection on unobservables
Matching's fatal weakness is that it can only adjust for what you OBSERVE — and the conditional-independence assumption (no unobserved confounders) is UNTESTABLE and usually IMPLAUSIBLE. The whole problem of selection bias (module 1) is typically driven by UNOBSERVABLES — the motivation, ability, drive, or private information that leads units to select into treatment AND affects their outcomes, and that you cannot measure. Matching on observed characteristics does nothing about these: two people with identical observed age, education, and earnings can still differ in unmeasured entrepreneurial drive, and if that drove their treatment choice, matching leaves the bias intact. So matching is only as good as the claim that you've observed and adjusted for EVERY relevant confounder — a claim that is rarely credible, because the confounders that cause selection are usually exactly the hard-to-measure ones. This is why matching sits LOW in the credibility hierarchy: it's better than a raw comparison (it removes observable differences), but it cannot solve selection on unobservables, which is the heart of the problem. Treat matching estimates with caution and never mistake 'we controlled for observables' for 'we eliminated selection bias'.
Instrumental variables
Finding as-good-as-random variation (IV)
Instrumental variables (IV) is the strategy for selection on UNOBSERVABLES. The idea: find an INSTRUMENT — a variable that affects WHETHER a unit gets treated but affects the OUTCOME ONLY THROUGH treatment (not directly or through anything else). A valid instrument isolates a slice of variation in treatment that is AS-GOOD-AS-RANDOM (not driven by the units' own confounded choices), and uses only that variation to estimate the effect. The two requirements: • Relevance — the instrument must actually affect treatment (a strong first stage; weak instruments give unreliable estimates). • Exclusion restriction (exogeneity) — the instrument affects the outcome ONLY through treatment, and is unrelated to the unobserved confounders. This is the crucial, UNTESTABLE assumption, defended by argument, not data. Examples: using DISTANCE to the nearest college as an instrument for college attendance (distance affects attendance but, arguably, not earnings except through education — Card); QUARTER OF BIRTH as an instrument for years of schooling (compulsory-schooling laws make birth-quarter affect schooling — Angrist-Krueger); RAINFALL as an instrument for income (rain affects farm income but not the outcome directly). A valid IV recovers the LATE — the effect for compliers (those whose treatment is shifted by the instrument). The catch: GOOD instruments are rare and the exclusion restriction is hard to defend (any direct effect of the instrument on the outcome invalidates it), so IV studies live or die on the credibility of the exclusion restriction — which must be argued carefully and is often contestable.
The credibility hierarchy
Ranking the methods
The methods of this course form a rough CREDIBILITY HIERARCHY, by how plausibly they eliminate selection bias: 1. RCT — randomisation balances observables AND unobservables; the gold standard (when feasible and ethical). 2. Regression discontinuity & well-designed difference-in-differences — credible natural experiments (local randomisation at a cutoff; parallel trends with good pre-trends) — strong when their assumptions hold. 3. Instrumental variables — can handle unobservables, but only as credible as the (untestable, often contestable) exclusion restriction. 4. Matching / propensity scores / regression controls — only handle selection on OBSERVABLES; cannot address unobserved confounders. This ranking reflects how much each method asks you to assume: the RCT asks almost nothing (chance did the work); matching asks you to believe you've measured every confounder (usually implausible). The practical lesson: prefer designs higher in the hierarchy where possible; when forced lower, be explicit about the assumptions, defend them carefully, probe robustness, and calibrate your confidence accordingly. A result from a credible RCT or RD deserves more weight than one from matching, and honest empirical work makes its identifying assumption — and its credibility — explicit. This hierarchy is the practical summary of the whole methods sequence.
Exercise
A researcher wants to estimate the effect of joining a farmer cooperative on farm income, using observational survey data (cooperative members vs non-members). (1) Explain how propensity-score matching would approach this and its key assumption. (2) Explain why matching is likely to fail here, citing the specific unobservable. (3) Propose an instrumental-variables strategy and state what the instrument must satisfy. (4) Place the available approaches in the credibility hierarchy and advise on the best feasible design.