Matrix calculus for optimisation — Linear Algebra Module 11

Optimisation problems live or die on whether you can compute gradients and Hessians cleanly. Matrix calculus is the bookkeeping system that lets you differentiate scalar functions of vectors and matrices without dropping into index notation for every step. A handful of identities, applied carefully, generate every gradient you'll need in portfolio optimisation, ML, and econometrics.

Conventions

We use denominator layout: if f: Rⁿ → R, then ∇f (the gradient) is a column vector in Rⁿ; if f: Rⁿ → Rᵐ, the Jacobian ∂f/∂x is m×n. Hessian H = ∂²f/∂x∂xᵀ is symmetric for twice-differentiable f. Other texts use numerator layout — pick one and stick with it.

Core identities

math

∂(aᵀx)/∂x = a
∂(xᵀa)/∂x = a
∂(xᵀx)/∂x = 2x
∂(xᵀAx)/∂x = (A + Aᵀ)x      (= 2Ax if A is symmetric)
∂²(xᵀAx)/∂x∂xᵀ = A + Aᵀ       (Hessian; = 2A symmetric)

The quadratic form identity is the workhorse

Almost every portfolio-optimisation first-order condition reduces to '∂(wᵀΣw)/∂w + (linear term) = 0', which collapses to '2Σw + (linear term) = 0' since Σ is symmetric. Memorise this.

Minimum-variance portfolio derivation

Problem: min (1/2) wᵀΣw subject to 1ᵀw = 1 (weights sum to 1). Lagrangian: L = (1/2) wᵀΣw - λ(1ᵀw - 1). First-order conditions:

math

∂L/∂w = Σw - λ·1 = 0  ⟹  w = λ Σ⁻¹ 1
∂L/∂λ = 1ᵀw - 1 = 0   ⟹  λ = 1 / (1ᵀ Σ⁻¹ 1)

w_MV = Σ⁻¹ 1 / (1ᵀ Σ⁻¹ 1)

Three lines of matrix calculus deliver the global minimum-variance portfolio in closed form. This is why linear algebra is the right language for portfolio theory.

OLS gradient

L(β) = (y - Xβ)ᵀ(y - Xβ) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ. Differentiating:

math

∂L/∂β = -2Xᵀy + 2XᵀXβ = 0
β̂ = (XᵀX)⁻¹ Xᵀy

Trace tricks

Three identities used constantly:

tr(AB) = tr(BA), tr(ABC) = tr(BCA) = tr(CAB) — cyclic permutation.
∂ tr(AX)/∂X = Aᵀ
∂ log det(X)/∂X = (X⁻¹)ᵀ (the determinant trick, fundamental in Gaussian MLE)

Gaussian MLE — why the determinant trick matters

The log-likelihood of an MVN sample contains a -½ log det(Σ) term. The derivative with respect to Σ uses ∂ log det(Σ)/∂Σ = (Σ⁻¹)ᵀ = Σ⁻¹ (Σ symmetric). Combined with ∂ tr(SΣ⁻¹)/∂Σ = -Σ⁻¹SΣ⁻¹, the MLE Σ̂ = S (sample covariance) drops out in one line.

Chain rule for matrix derivatives

If h(x) = g(f(x)) where f: Rⁿ → Rᵐ and g: Rᵐ → R: ∇h(x) = (∂f/∂x)ᵀ ∇g(f(x)). For deep-learning aficionados: this is just backprop. For everyone else: it's how you differentiate things like (Xw - y)ᵀ M (Xw - y).

Exercise

Derive the maximum-Sharpe (tangency) portfolio with a riskless rate r_f. Problem: maximise (μᵀw - r_f) / √(wᵀΣw), which is scale-invariant in w, so solve max μᵀw - λ wᵀΣw / 2 with λ chosen by normalisation. Find w∗.