Endogeneity — the central problem — Econometrics Module 6

Endogeneity is the central problem in applied econometrics. It is what makes regressions on observational data hard to interpret. Every named method later in the course is a strategy for dealing with it.

What endogeneity is

A regressor is endogenous if it is correlated with the error term: Cov(x, u) ≠ 0. When that happens, OLS is biased AND inconsistent — adding more data does not fix it.

math

plim β̂_OLS = β + Cov(x, u) / Var(x)

If x and u move together (Cov > 0), OLS overstates β. If they move opposite, OLS understates it. The bias doesn't go away with more observations.

The four sources of endogeneity

1. Omitted variables

An omitted variable z that affects y and is correlated with x ends up in the error term, dragging x's coefficient with it. Wage on schooling without controlling for ability: ability raises both schooling and wages, so β̂_OLS overstates the schooling effect.

2. Simultaneity / reverse causation

y also causes x. Demand and supply both respond to price; a regression of quantity on price can't tell you the demand or supply elasticity unless you decompose. The classic identification problem of econometrics — and the reason instrumental variables exist.

3. Measurement error in x

If x is measured with classical noise (independent of true x), the regression coefficient is biased toward zero. The 'attenuation bias' is the reason self-reported income measures generate weaker associations than administrative data does.

4. Selection

The sample isn't random — observation depends on outcomes. Studying returns to college using only those who earn high wages biases everything. The Heckman two-step model and Lee bounds give partial corrections; the cleaner answer is usually a different identification strategy.

How to recognise endogeneity

Ask: 'what would change x without changing u directly?' If you can't think of anything, you have a problem
Compare your point estimate to plausible bounds. Implausibly large coefficients often signal endogeneity
Compare cross-section to panel within-estimates. If they differ wildly, time-invariant unobservables are biasing one of them

The clean question

Whenever someone shows you a regression and claims a causal effect, ask: 'what's the source of variation in x that you're using, and is it plausibly exogenous?' If the answer is 'whatever variation is in the data', you should be very suspicious of the causal claim.

What to do about it

Find a credible instrument (Module 7)
Exploit a natural experiment via diff-in-diff (Module 8)
Use panel fixed effects to soak up time-invariant unobservables (Module 9)
Run a randomised controlled trial — the gold standard, expensive but cleanest
Bound the bias under plausible assumptions (Oster 2019, Conley-Hansen-Rossi)

Exercise

You want to estimate the returns to a paid online course. You regress income one year later on whether someone took the course (1/0) and find β̂ = $5,000. List two reasons this estimate is likely biased and which direction.