Skip to content
Module 06 of 1255 min readIntermediate

Endogeneity — the central problem

Omitted variables, simultaneity, measurement error, and reverse causation. The four ways your coefficient ends up wrong.

50%

Listen along

Read “Endogeneity — the central problem” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Endogeneity is the central problem in applied econometrics. It is what makes regressions on observational data hard to interpret. Every named method later in the course is a strategy for dealing with it.

What endogeneity is

A regressor is endogenous if it is correlated with the error term: Cov(x, u) ≠ 0. When that happens, OLS is biased AND inconsistent — adding more data does not fix it.

math
plim β̂_OLS = β + Cov(x, u) / Var(x)

If x and u move together (Cov > 0), OLS overstates β. If they move opposite, OLS understates it. The bias doesn't go away with more observations.

The four sources of endogeneity

1. Omitted variables

An omitted variable z that affects y and is correlated with x ends up in the error term, dragging x's coefficient with it. Wage on schooling without controlling for ability: ability raises both schooling and wages, so β̂_OLS overstates the schooling effect.

2. Simultaneity / reverse causation

y also causes x. Demand and supply both respond to price; a regression of quantity on price can't tell you the demand or supply elasticity unless you decompose. The classic identification problem of econometrics — and the reason instrumental variables exist.

3. Measurement error in x

If x is measured with classical noise (independent of true x), the regression coefficient is biased toward zero. The 'attenuation bias' is the reason self-reported income measures generate weaker associations than administrative data does.

4. Selection

The sample isn't random — observation depends on outcomes. Studying returns to college using only those who earn high wages biases everything. The Heckman two-step model and Lee bounds give partial corrections; the cleaner answer is usually a different identification strategy.

How to recognise endogeneity

  • Ask: 'what would change x without changing u directly?' If you can't think of anything, you have a problem
  • Compare your point estimate to plausible bounds. Implausibly large coefficients often signal endogeneity
  • Compare cross-section to panel within-estimates. If they differ wildly, time-invariant unobservables are biasing one of them

The clean question

Whenever someone shows you a regression and claims a causal effect, ask: 'what's the source of variation in x that you're using, and is it plausibly exogenous?' If the answer is 'whatever variation is in the data', you should be very suspicious of the causal claim.

What to do about it

  • Find a credible instrument (Module 7)
  • Exploit a natural experiment via diff-in-diff (Module 8)
  • Use panel fixed effects to soak up time-invariant unobservables (Module 9)
  • Run a randomised controlled trial — the gold standard, expensive but cleanest
  • Bound the bias under plausible assumptions (Oster 2019, Conley-Hansen-Rossi)

Exercise

You want to estimate the returns to a paid online course. You regress income one year later on whether someone took the course (1/0) and find β̂ = $5,000. List two reasons this estimate is likely biased and which direction.

Loading progress…
LeadAfrikPublic Economics Hub