Skip to content
Module 02 of 1260 min readIntermediate

The linear regression — derivation and intuition

OLS as the line that minimises squared residuals. Geometric, algebraic, and projection-based views. When does the line mean something?

17%

Listen along

Read “The linear regression — derivation and intuition” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

The linear regression model says one variable is approximately a linear function of others, plus an error term. It is the workhorse of applied empirical economics, the model that nearly every causal study ultimately reduces to, and the foundation that every subsequent module in this course builds on. Get the variables and their interpretations clear here and everything that follows becomes routine.

math
yᵢ = β₀ + β₁ x₁ᵢ + β₂ x₂ᵢ + ... + βₖ xₖᵢ + uᵢ

Variable glossary — every symbol explained

  • yᵢ — the outcome (dependent variable, regressand) for observation i. The thing we are trying to explain. In wage regressions, it's the wage; in growth regressions, it's GDP growth; in development regressions, it could be a binary outcome like whether a household uses a particular technology.
  • i — the observation index, running 1, 2, …, n. Each i is a unit — a person, a firm, a country, a country-year cell. n is the sample size; large samples give precise estimates, small samples give noisy ones.
  • x₁ᵢ, x₂ᵢ, …, xₖᵢ — the predictors (covariates, regressors, independent variables) for observation i. The variables we think might explain y. In a wage regression, these include years of schooling, experience, gender, region, industry. k is the number of regressors.
  • β₀ — the intercept. The expected value of y when every regressor equals zero. Often not economically meaningful in itself (the predicted wage of someone with zero years of schooling, zero experience, etc.) but mathematically required for the line to fit correctly. Reported in the units of y.
  • β₁, β₂, …, βₖ — the slope coefficients on each regressor. Each β_j is the expected change in y associated with a one-unit increase in x_j, holding all other regressors fixed. The 'holding fixed' qualification is the entire identification game — most of the rest of econometrics is about when that interpretation survives scrutiny.
  • uᵢ — the error term (also called the disturbance or residual at the population level). Everything systematic about yᵢ that the regressors do not capture, plus pure noise. The whole apparatus of OLS assumptions, identification strategies, and standard errors is about what we can say about uᵢ and how it relates to the regressors.

Why each variable matters in practice

y must be measured well — measurement error in y just inflates standard errors, but if y is mismeasured systematically the results bias. The x's drive identification — adding or removing regressors changes the meaning of every β you report. β₀ is rarely the focus but its sign and magnitude provide a sanity check (negative predicted wages would flag a problem). The β coefficients are what you report and interpret, but their causal interpretation depends entirely on what's in u. The error term u is where every causal-inference concern lives — endogeneity, omitted variables, measurement error, simultaneity all show up as correlations between u and the regressors.

Three ways to think about OLS

1. Minimising squared residuals

Pick (β₀, β₁, ..., βₖ) to minimise the sum of squared deviations between actual and predicted y:

math
min Σᵢ (yᵢ − β₀ − β₁ x₁ᵢ − ... − βₖ xₖᵢ)²

Why squared? Because absolute deviations would not be differentiable everywhere, and squaring punishes large deviations more — which usually matches our preferences in economics.

2. Geometric: orthogonal projection

OLS projects the vector y onto the column space of the X matrix. The residuals u = y − ŷ are perpendicular to every regressor. That's a deep result and the source of nearly every algebraic identity in econometrics — including the Frisch-Waugh-Lovell theorem we cover in Module 5.

3. Method of moments

Assume E[X·u] = 0 (regressors are uncorrelated with the error). The sample analogue gives the OLS normal equations. This view generalises naturally to GMM (Hansen, 1982), which is what most modern empirical methods reduce to.

The simple regression formula (one regressor)

math
β̂₁ = Cov(x, y) / Var(x)
β̂₀ = ȳ − β̂₁ · x̄
  • β̂₁ — the OLS slope estimate, the sample-based estimate of the true population β₁. The hat symbol signals 'this is an estimator from data, not the unknown population parameter'.
  • Cov(x, y) — the sample covariance between x and y. Captures how x and y co-move; positive when they tend to move together, negative when one rises as the other falls. Computed as (1/n) × Σ (xᵢ − x̄)(yᵢ − ȳ).
  • Var(x) — the sample variance of x. Captures the spread of x in the data. Computed as (1/n) × Σ (xᵢ − x̄)². Large variance in x means the regression has 'leverage' to identify the slope; tiny variance makes the slope estimate noisy.
  • β̂₀ — the OLS intercept estimate. Forces the regression line to pass exactly through the point (x̄, ȳ) — the mean of both variables.
  • ȳ — the sample mean of y, computed as (1/n) × Σ yᵢ.
  • x̄ — the sample mean of x, computed as (1/n) × Σ xᵢ.

The slope is the covariance between x and y scaled by the variance of x — a normalisation that puts the answer in the right units (per unit of x). The intercept then anchors the line to pass through the joint means. The formula's elegance hides considerable depth: every assumption about exogeneity, homoskedasticity, and independence enters when you ask whether β̂₁ is unbiased, consistent, or efficient for the true β₁.

What does β₁ mean?

β₁ is the predicted change in y for a one-unit increase in x, holding everything else fixed. The 'holding everything else fixed' part is the entire game — and most of the rest of the course is about when that interpretation survives scrutiny.

Goodness of fit: R²

math
R² = 1 − (SSR / SST) = ESS / SST

Total sum of squares (SST) = explained (ESS) + residual (SSR). R² is the share explained by the regression. Adding regressors can only raise R²; adjusted R² penalises that.

R² is not what you think it is

High R² is not evidence of causation. Low R² is not evidence the regression is useless. Cross-section R²s of 0.05 are common in studies that produce credible causal estimates. R² answers a prediction question, not a causal one.

Exercise

If you regress wage on years of schooling and get β̂₁ = 1,200 (KSh per year), what does this number mean — and what does it not mean?

Loading progress…
LeadAfrikPublic Economics Hub