Skip to content
Module 06 of 1260 min readIntermediate

Estimation — MLE, MoM, and properties

Estimators as random variables. Method of moments, maximum likelihood. Bias, consistency, efficiency, the Cramér-Rao bound.

50%

Listen along

Read “Estimation — MLE, MoM, and properties” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

An estimator is a function of data that approximates an unknown parameter. The quality of an estimator is judged on bias (does its expectation hit the truth?), variance (how much does it jitter sample-to-sample?), and consistency (does it converge to the truth as n grows?). MLE — maximum likelihood — is the dominant approach because it is asymptotically optimal under regularity conditions.

Method of moments (MoM)

Equate sample moments to population moments and solve for parameters. For a normal: set X̄ = μ and S² = σ². Easy and often used as a starting point. Less efficient than MLE in general but more robust to misspecification.

Maximum likelihood

Given data x = (x₁, ..., xₙ) drawn i.i.d. from f(x; θ), the likelihood is L(θ) = Π f(xᵢ; θ). The log-likelihood is ℓ(θ) = Σ log f(xᵢ; θ). The MLE θ̂ maximises ℓ(θ).

math
θ̂ = argmax_θ ℓ(θ)
∂ℓ/∂θ |_{θ̂} = 0 (score equation, when smooth)

Properties of MLE under regularity

  1. Consistency: θ̂ →p θ₀ (the true parameter).
  2. Asymptotic normality: √n (θ̂ - θ₀) →d N(0, I(θ₀)⁻¹), where I is the Fisher information matrix.
  3. Asymptotic efficiency: MLE attains the Cramér-Rao lower bound. No unbiased estimator has smaller variance, asymptotically.
  4. Invariance: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.

Fisher information

math
I(θ) = -E[∂²ℓ/∂θ²] (per observation)
Var(θ̂) ≥ 1 / (n I(θ)) (Cramér-Rao bound)

Information measures how 'identifiable' θ is from the data. Larger Fisher information → tighter possible estimates. The Hessian of the negative log-likelihood at the MLE is a consistent estimator of the (sample) information matrix; its inverse is the standard sandwich for the asymptotic covariance of θ̂.

MLE for a normal sample

Take X₁, ..., Xₙ ~ N(μ, σ²). The log-likelihood is ℓ(μ, σ²) = -(n/2) log(2π) - (n/2) log σ² - (1/(2σ²)) Σ (xᵢ - μ)². Setting derivatives to zero:

math
μ̂ = X̄
σ̂² = (1/n) Σ (xᵢ - X̄)² (biased; divide by n-1 for unbiased estimator)

MLE vs unbiased — pick your poison

MLE σ̂² uses divisor n, the unbiased estimator s² uses n-1. The MLE is biased but has smaller MSE; the unbiased estimator has zero bias. For inference, n-1 is the convention; for prediction, the MLE form often wins.

Bias-variance trade-off

math
MSE(θ̂) = Var(θ̂) + Bias(θ̂)²

Shrinkage estimators, ridge regression, Bayesian posteriors all introduce a small bias to gain a large reduction in variance — and lower MSE overall. The James-Stein estimator famously shows that the sample mean is inadmissible in dimension ≥ 3: a shrunk-toward-zero alternative has strictly lower MSE everywhere.

GMM — generalisation

Generalised Method of Moments (Hansen, 1982) generalises MoM to over-identified problems with more moment conditions than parameters. Minimises a weighted quadratic in the sample moments. Encompasses MLE, IV, 2SLS as special cases. The lingua franca of structural econometrics.

Exercise

You observe n i.i.d. returns assumed to be N(μ, σ²) with σ known. (1) Derive the MLE of μ. (2) Compute its Fisher information. (3) Compare to the Cramér-Rao bound. (4) Comment on what the bound says about the smallest possible standard error of any unbiased estimator of μ.

Loading progress…
LeadAfrikPublic Economics Hub