Least squares is the most-used algorithm in applied quantitative work. It solves the overdetermined system Ax ≈ b — more equations than unknowns — by finding the x that minimises ‖Ax - b‖². It appears in OLS regression, calibration of yield-curve models, kernel-density bandwidth selection, and dozens of other places.
The normal equations
The minimiser of ‖Ax - b‖² satisfies the first-order condition that the gradient vanishes. Differentiating and setting to zero:
Aᵀ(Ax - b) = 0AᵀA x = Aᵀbx̂ = (AᵀA)⁻¹ Aᵀb (normal equations)
Geometrically, the residual r = b - Ax̂ is orthogonal to every column of A — equivalently, x̂ is the projection of b onto C(A). This is the projection theorem in equation form.
Don't actually use the normal equations
The condition number of AᵀA is the square of the condition number of A. If A is mildly ill-conditioned (κ(A) = 1000), then AᵀA is severely ill-conditioned (κ(AᵀA) = 10⁶). Numerical libraries solve least-squares problems by QR or SVD, not by forming AᵀA explicitly. We do so in math; we don't in code.
QR decomposition
Any m×n matrix A with full column rank can be factored as A = QR, where Q is m×n with orthonormal columns (QᵀQ = I) and R is n×n upper-triangular. Substituting into the normal equations gives the numerically stable solution:
A = QRAᵀA x = AᵀbRᵀR x = RᵀQᵀbRx = Qᵀb (since R is invertible, RᵀR x = RᵀQᵀb ⟹ Rx = Qᵀb)x̂ = R⁻¹ Qᵀb (back-sub, O(n²))
QR factorisation costs O(mn²) using Householder reflections (the standard production algorithm). It is the workhorse behind numpy.linalg.lstsq, R's lm.fit, and the inside of every modern OLS routine.
Gram-Schmidt — orthogonalisation in plain English
Gram-Schmidt produces an orthonormal basis from any linearly-independent set. Apply it to the columns of A and you get the Q factor of QR directly. The geometric idea: at step k, subtract from column k its projection onto the span of the previous columns, then normalise.
Classical vs modified Gram-Schmidt
The textbook version (classical) is numerically unstable when columns are nearly parallel. Modified Gram-Schmidt — fixing one Q column at a time and subtracting its projection from all later columns — is much more stable and is what's used in numerical libraries (when QR is computed by Gram-Schmidt rather than Householder).
Coefficient covariance for OLS
Under the OLS assumptions (Econometrics Module 3), the covariance of β̂ is σ²(XᵀX)⁻¹. Standard errors are square roots of the diagonal. Heteroskedasticity-robust (White) covariance uses (XᵀX)⁻¹ Xᵀ diag(û²) X (XᵀX)⁻¹ — the famous sandwich estimator.
Exercise
You regress 252 days of stock returns on a constant and the market return. (1) Write the design matrix X. (2) Write the OLS normal equation. (3) The intercept is α, the slope is β (the asset's market beta). Express β̂ in terms of sample variances and covariances.