Skip to content
Module 06 of 1260 min readIntermediate

Least squares and QR decomposition

OLS as a projection. The normal equations. QR via Gram-Schmidt and why it beats the normal equations numerically.

50%

Listen along

Read “Least squares and QR decomposition” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Least squares is the most-used algorithm in applied quantitative work. It solves the overdetermined system Ax ≈ b — more equations than unknowns — by finding the x that minimises ‖Ax - b‖². It appears in OLS regression, calibration of yield-curve models, kernel-density bandwidth selection, and dozens of other places.

The normal equations

The minimiser of ‖Ax - b‖² satisfies the first-order condition that the gradient vanishes. Differentiating and setting to zero:

math
Aᵀ(Ax - b) = 0
AᵀA x = Aᵀb
x̂ = (AᵀA)⁻¹ Aᵀb (normal equations)

Geometrically, the residual r = b - Ax̂ is orthogonal to every column of A — equivalently, x̂ is the projection of b onto C(A). This is the projection theorem in equation form.

Don't actually use the normal equations

The condition number of AᵀA is the square of the condition number of A. If A is mildly ill-conditioned (κ(A) = 1000), then AᵀA is severely ill-conditioned (κ(AᵀA) = 10⁶). Numerical libraries solve least-squares problems by QR or SVD, not by forming AᵀA explicitly. We do so in math; we don't in code.

QR decomposition

Any m×n matrix A with full column rank can be factored as A = QR, where Q is m×n with orthonormal columns (QᵀQ = I) and R is n×n upper-triangular. Substituting into the normal equations gives the numerically stable solution:

math
A = QR
AᵀA x = Aᵀb
RᵀR x = RᵀQᵀb
Rx = Qᵀb (since R is invertible, RᵀR x = RᵀQᵀb ⟹ Rx = Qᵀb)
x̂ = R⁻¹ Qᵀb (back-sub, O(n²))

QR factorisation costs O(mn²) using Householder reflections (the standard production algorithm). It is the workhorse behind numpy.linalg.lstsq, R's lm.fit, and the inside of every modern OLS routine.

Gram-Schmidt — orthogonalisation in plain English

Gram-Schmidt produces an orthonormal basis from any linearly-independent set. Apply it to the columns of A and you get the Q factor of QR directly. The geometric idea: at step k, subtract from column k its projection onto the span of the previous columns, then normalise.

Classical vs modified Gram-Schmidt

The textbook version (classical) is numerically unstable when columns are nearly parallel. Modified Gram-Schmidt — fixing one Q column at a time and subtracting its projection from all later columns — is much more stable and is what's used in numerical libraries (when QR is computed by Gram-Schmidt rather than Householder).

Coefficient covariance for OLS

Under the OLS assumptions (Econometrics Module 3), the covariance of β̂ is σ²(XᵀX)⁻¹. Standard errors are square roots of the diagonal. Heteroskedasticity-robust (White) covariance uses (XᵀX)⁻¹ Xᵀ diag(û²) X (XᵀX)⁻¹ — the famous sandwich estimator.

Exercise

You regress 252 days of stock returns on a constant and the market return. (1) Write the design matrix X. (2) Write the OLS normal equation. (3) The intercept is α, the slope is β (the asset's market beta). Express β̂ in terms of sample variances and covariances.

Loading progress…
LeadAfrikPublic Economics Hub