Optimisation problems live or die on whether you can compute gradients and Hessians cleanly. Matrix calculus is the bookkeeping system that lets you differentiate scalar functions of vectors and matrices without dropping into index notation for every step. A handful of identities, applied carefully, generate every gradient you'll need in portfolio optimisation, ML, and econometrics.
Conventions
We use denominator layout: if f: Rⁿ → R, then ∇f (the gradient) is a column vector in Rⁿ; if f: Rⁿ → Rᵐ, the Jacobian ∂f/∂x is m×n. Hessian H = ∂²f/∂x∂xᵀ is symmetric for twice-differentiable f. Other texts use numerator layout — pick one and stick with it.
Core identities
∂(aᵀx)/∂x = a∂(xᵀa)/∂x = a∂(xᵀx)/∂x = 2x∂(xᵀAx)/∂x = (A + Aᵀ)x (= 2Ax if A is symmetric)∂²(xᵀAx)/∂x∂xᵀ = A + Aᵀ (Hessian; = 2A symmetric)
The quadratic form identity is the workhorse
Almost every portfolio-optimisation first-order condition reduces to '∂(wᵀΣw)/∂w + (linear term) = 0', which collapses to '2Σw + (linear term) = 0' since Σ is symmetric. Memorise this.
Minimum-variance portfolio derivation
Problem: min (1/2) wᵀΣw subject to 1ᵀw = 1 (weights sum to 1). Lagrangian: L = (1/2) wᵀΣw - λ(1ᵀw - 1). First-order conditions:
∂L/∂w = Σw - λ·1 = 0 ⟹ w = λ Σ⁻¹ 1∂L/∂λ = 1ᵀw - 1 = 0 ⟹ λ = 1 / (1ᵀ Σ⁻¹ 1)w_MV = Σ⁻¹ 1 / (1ᵀ Σ⁻¹ 1)
Three lines of matrix calculus deliver the global minimum-variance portfolio in closed form. This is why linear algebra is the right language for portfolio theory.
OLS gradient
L(β) = (y - Xβ)ᵀ(y - Xβ) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ. Differentiating:
∂L/∂β = -2Xᵀy + 2XᵀXβ = 0β̂ = (XᵀX)⁻¹ Xᵀy
Trace tricks
Three identities used constantly:
- tr(AB) = tr(BA), tr(ABC) = tr(BCA) = tr(CAB) — cyclic permutation.
- ∂ tr(AX)/∂X = Aᵀ
- ∂ log det(X)/∂X = (X⁻¹)ᵀ (the determinant trick, fundamental in Gaussian MLE)
Gaussian MLE — why the determinant trick matters
The log-likelihood of an MVN sample contains a -½ log det(Σ) term. The derivative with respect to Σ uses ∂ log det(Σ)/∂Σ = (Σ⁻¹)ᵀ = Σ⁻¹ (Σ symmetric). Combined with ∂ tr(SΣ⁻¹)/∂Σ = -Σ⁻¹SΣ⁻¹, the MLE Σ̂ = S (sample covariance) drops out in one line.
Chain rule for matrix derivatives
If h(x) = g(f(x)) where f: Rⁿ → Rᵐ and g: Rᵐ → R: ∇h(x) = (∂f/∂x)ᵀ ∇g(f(x)). For deep-learning aficionados: this is just backprop. For everyone else: it's how you differentiate things like (Xw - y)ᵀ M (Xw - y).
Exercise
Derive the maximum-Sharpe (tangency) portfolio with a riskless rate r_f. Problem: maximise (μᵀw - r_f) / √(wᵀΣw), which is scale-invariant in w, so solve max μᵀw - λ wᵀΣw / 2 with λ chosen by normalisation. Find w∗.