Regression is the statistical workhorse — fitting linear relationships between a target and predictors, then evaluating uncertainty in the fit. We covered OLS as projection (Linear Algebra Module 5-6); here we treat it statistically: the sampling distribution of β̂, the t- and F-machinery, and the R² caveats.
Setup
y = Xβ + u, u ~ N(0, σ² I)β̂ = (XᵀX)⁻¹ Xᵀy
Sampling distribution of β̂
Under the OLS assumptions, β̂ is a linear function of y, hence normally distributed:
β̂ ~ N(β, σ² (XᵀX)⁻¹)
Standard errors of individual coefficients are square roots of the diagonal of σ²(XᵀX)⁻¹. We don't know σ², so we estimate it by s² = û'û / (n - k).
t-statistic and F-statistic
t_j = β̂_j / SE(β̂_j) ~ t_{n-k} (under H₀: β_j = 0)F = ((SSR_R - SSR_U) / q) / (SSR_U / (n - k)) ~ F_{q, n-k} (joint H₀)
The t tests one coefficient at a time; the F tests joint restrictions (e.g., 'all slopes equal zero'). In large samples both distributions converge to z and χ²/q respectively.
R² and adjusted R²
R² = 1 - SSR / SSTAdj R² = 1 - (1 - R²)(n - 1)/(n - k - 1)
R² inflates with predictors
Adding any regressor — including pure noise — never decreases R². Adjusted R² penalises this. But neither is a model-quality measure: a perfect identity X = X gives R² = 1; a regression of return on lagged return at daily frequency typically gives R² ≈ 0.01, which is excellent for finance. Domain-relative R²s, not absolute thresholds.
Robust standard errors
Under heteroskedasticity (Var(uᵢ | X) varies), the classical SE formula is wrong. White's heteroskedasticity-consistent (HC0) estimator and its small-sample variants (HC1, HC2, HC3) replace σ²(XᵀX)⁻¹ with the sandwich (XᵀX)⁻¹ Xᵀ diag(û²) X (XᵀX)⁻¹. Use HC1 by default; HC3 if you have small samples or high-leverage observations.
Clustered standard errors
When observations are grouped (firm-year panels, household survey blocks), residuals within clusters are correlated. Cluster-robust SEs (Liang-Zeger 1986, Cameron-Miller 2015) replace diag(û²) with the within-cluster outer-product. Vital for panel data; failing to cluster typically halves the standard errors.
OLS as MLE
Under the Gaussian-errors assumption, OLS β̂ is also the MLE. The connection: maximising the Gaussian log-likelihood is equivalent to minimising the sum of squared residuals. This is why OLS attains the Cramér-Rao bound and is asymptotically efficient under correct specification.
Misspecification
- Omitted variable: β̂ for included regressors is biased.
- Wrong functional form: nonlinear y(X) but linear fit produces residuals that look like a pattern.
- Heteroskedasticity: SEs wrong, β̂ still unbiased.
- Autocorrelation: SEs wrong (Newey-West fix), β̂ still unbiased in cross-section.
- Endogeneity: β̂ inconsistent. The big one — Econometrics Module 6.
Exercise
You regress monthly stock returns on the market return (24 months). β̂ = 1.3, SE(β̂) = 0.2. (1) Is β̂ statistically different from 1 (the market beta)? (2) Compute a 95% CI for β. (3) The R² is 0.45. Interpret.