An estimator is a function of data that approximates an unknown parameter. The quality of an estimator is judged on bias (does its expectation hit the truth?), variance (how much does it jitter sample-to-sample?), and consistency (does it converge to the truth as n grows?). MLE — maximum likelihood — is the dominant approach because it is asymptotically optimal under regularity conditions.
Method of moments (MoM)
Equate sample moments to population moments and solve for parameters. For a normal: set X̄ = μ and S² = σ². Easy and often used as a starting point. Less efficient than MLE in general but more robust to misspecification.
Maximum likelihood
Given data x = (x₁, ..., xₙ) drawn i.i.d. from f(x; θ), the likelihood is L(θ) = Π f(xᵢ; θ). The log-likelihood is ℓ(θ) = Σ log f(xᵢ; θ). The MLE θ̂ maximises ℓ(θ).
θ̂ = argmax_θ ℓ(θ)∂ℓ/∂θ |_{θ̂} = 0 (score equation, when smooth)
Properties of MLE under regularity
- Consistency: θ̂ →p θ₀ (the true parameter).
- Asymptotic normality: √n (θ̂ - θ₀) →d N(0, I(θ₀)⁻¹), where I is the Fisher information matrix.
- Asymptotic efficiency: MLE attains the Cramér-Rao lower bound. No unbiased estimator has smaller variance, asymptotically.
- Invariance: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.
Fisher information
I(θ) = -E[∂²ℓ/∂θ²] (per observation)Var(θ̂) ≥ 1 / (n I(θ)) (Cramér-Rao bound)
Information measures how 'identifiable' θ is from the data. Larger Fisher information → tighter possible estimates. The Hessian of the negative log-likelihood at the MLE is a consistent estimator of the (sample) information matrix; its inverse is the standard sandwich for the asymptotic covariance of θ̂.
MLE for a normal sample
Take X₁, ..., Xₙ ~ N(μ, σ²). The log-likelihood is ℓ(μ, σ²) = -(n/2) log(2π) - (n/2) log σ² - (1/(2σ²)) Σ (xᵢ - μ)². Setting derivatives to zero:
μ̂ = X̄σ̂² = (1/n) Σ (xᵢ - X̄)² (biased; divide by n-1 for unbiased estimator)
MLE vs unbiased — pick your poison
MLE σ̂² uses divisor n, the unbiased estimator s² uses n-1. The MLE is biased but has smaller MSE; the unbiased estimator has zero bias. For inference, n-1 is the convention; for prediction, the MLE form often wins.
Bias-variance trade-off
MSE(θ̂) = Var(θ̂) + Bias(θ̂)²
Shrinkage estimators, ridge regression, Bayesian posteriors all introduce a small bias to gain a large reduction in variance — and lower MSE overall. The James-Stein estimator famously shows that the sample mean is inadmissible in dimension ≥ 3: a shrunk-toward-zero alternative has strictly lower MSE everywhere.
GMM — generalisation
Generalised Method of Moments (Hansen, 1982) generalises MoM to over-identified problems with more moment conditions than parameters. Minimises a weighted quadratic in the sample moments. Encompasses MLE, IV, 2SLS as special cases. The lingua franca of structural econometrics.
Exercise
You observe n i.i.d. returns assumed to be N(μ, σ²) with σ known. (1) Derive the MLE of μ. (2) Compute its Fisher information. (3) Compare to the Cramér-Rao bound. (4) Comment on what the bound says about the smallest possible standard error of any unbiased estimator of μ.