Reading empirical economics critically — Econometrics Module 12

Most published empirical economics is wrong, in the narrow sense that the headline coefficient is biased, the standard errors are too small, or the result wouldn't replicate. This is not a slur on the profession — it's an unavoidable feature of doing inference in a complicated world with finite data and publication incentives. Reading empirical work critically is its own skill, and the most valuable one a course like this can teach.

The five questions every empirical paper must answer

What is the source of variation? — what variation in x is the paper using to identify the effect?
Is that variation plausibly exogenous? — is x correlated with the error term, or with anything that's also in the error term?
What population does the estimate apply to? — average treatment effect, effect on the treated, LATE, marginal individuals?
How precise is the estimate, and is the precision honestly reported? — clustered SEs, multiple-comparisons correction, sample-size justification?
Does the result replicate across reasonable alternative specifications? — robustness, falsification, placebo tests?

If a paper doesn't address question 1 directly

Walk away. A paper that just says 'we run regressions and report results' is selling a number it cannot defend causally. The whole apparatus of econometrics exists to produce defensible answers to question 1.

Identification strategies, in order of credibility

Randomised controlled trial (RCT): random assignment of treatment. Cleanest, but expensive and sometimes unethical.
Natural experiment / regression discontinuity: a sharp threshold or arbitrary rule generates quasi-random variation in x.
Difference-in-differences: a policy change applies to some units and not others, over time. Requires parallel trends.
Instrumental variables: a variable affects x but not y directly. Requires defensible exclusion restriction.
Panel fixed effects: removes time-invariant unobservables. Doesn't fix time-varying endogeneity.
Selection-on-observables / matching: condition on enough observed covariates that what's left is as-good-as-random. Strongest when conditional-independence assumption is plausible.
Cross-sectional OLS with controls: kitchen-sink regression. Lowest credibility for causal claims; can be useful for description.

Specific bullshit detectors

Implausibly precise estimates

When a small effect comes with extremely tight standard errors (β = 0.05, SE = 0.005, t = 10), be suspicious. Often the SEs are wrong — clustering omitted, autocorrelation ignored, weak-instrument bias understated. Ask: would these SEs survive a more conservative cluster level?

Disappearing controls

If the headline coefficient drops substantially when an additional plausible control is added, the original coefficient was probably soaking up bias from that control. Oster (2019) bounds estimate how much further the coefficient would move with similar selection on unobservables.

Pre-trends

In any DiD, demand to see the pre-treatment event-study coefficients. If they're trending, the parallel-trends assumption fails and the headline coefficient is contaminated by trend continuation. Many published DiD papers from before 2018 fall here — re-examine them with the modern toolkit.

Weak instruments

First-stage F-statistic must be reported. Below 10 (Stock-Yogo cutoff) means the IV bias may exceed OLS bias, with wildly understated SEs. Many famous IV papers have weak first stages by modern standards (Card-style proximity instruments, Bartik shift-share variants without proper Borusyak-Hull-Jaravel adjustments).

Selective reporting

Five specifications shown, four with the desired result, the fifth tucked in a footnote. Pre-analysis plans (AEA RCT registry) help. Lacking those, demand a wide range of specifications shown side by side, not buried.

Magnitude implausibility

If a paper claims minimum-wage hikes raise employment 30% or that schooling raises wages 25% per year, the magnitude is wrong before any econometric scrutiny. Most real economic effects are small (a few percent). Papers claiming huge effects are usually picking up confounding.

What to do when you suspect a paper

Read the data section first. Where do the data come from? How was the sample selected? What's missing?
Read the identification section second. What's the source of variation? What threats to identification does the paper acknowledge, and how does it address each?
Look at the descriptive statistics. Do means and variances pass plausibility checks? Are sample sizes reasonable for the claims?
Find the robustness section. What alternative specifications were tried? Did the headline coefficient change?
Replicate, if possible. Many top journals now require code and data to be posted. Run it. Modify it. See if the result is fragile.

What to do when YOU produce empirical work

Pre-register if you can
Explain identification in plain English in the abstract
Show pre-trends, placebo tests, falsification checks
Cluster SEs at the meaningful level; report the rationale
Show results across multiple specifications, not just the favourite
Report magnitudes alongside p-values; interpret economic significance
Quantify how robust the conclusion is to plausible departures from key assumptions

The Andrew Gelman test

After reading any empirical claim, ask: 'what would I have written if the result had been the opposite?' If you can write a plausible story either way, the design isn't doing the identification work — the storytelling is. Look for designs whose predictions would have been falsified by other outcomes.

Living with uncertainty

Empirical economics is harder than physics. Causal claims about complex social systems will always carry residual uncertainty. The ambition is not to eliminate it but to make it explicit, bounded, and testable. The best empirical economists wear their uncertainty publicly — quantifying what they know, what they don't, and what assumption-revisions would change the conclusion.

Read papers with the question 'what would I need to believe for this to be wrong?' rather than 'is this right?'. The first prompts you to identify assumptions; the second invites you to defer to the authors' confidence. The first is the right disposition for a serious analyst.

Exercise

You're reading a paper that claims free school meals raise test scores by 0.3 standard deviations. The identification is OLS with controls for income and household size, on a cross-section of 50,000 students. List three specific concerns and what you'd want the paper to show.