Skip to content
Module 11 of 1255 min readIntermediate

Matrix calculus for optimisation

Gradient and Hessian of quadratic forms. Derivative of xᵀAx. The exact identities that drive every portfolio optimisation.

92%

Listen along

Read “Matrix calculus for optimisation” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Optimisation problems live or die on whether you can compute gradients and Hessians cleanly. Matrix calculus is the bookkeeping system that lets you differentiate scalar functions of vectors and matrices without dropping into index notation for every step. A handful of identities, applied carefully, generate every gradient you'll need in portfolio optimisation, ML, and econometrics.

Conventions

We use denominator layout: if f: Rⁿ → R, then ∇f (the gradient) is a column vector in Rⁿ; if f: Rⁿ → Rᵐ, the Jacobian ∂f/∂x is m×n. Hessian H = ∂²f/∂x∂xᵀ is symmetric for twice-differentiable f. Other texts use numerator layout — pick one and stick with it.

Core identities

math
∂(aᵀx)/∂x = a
∂(xᵀa)/∂x = a
∂(xᵀx)/∂x = 2x
∂(xᵀAx)/∂x = (A + Aᵀ)x (= 2Ax if A is symmetric)
∂²(xᵀAx)/∂x∂xᵀ = A + Aᵀ (Hessian; = 2A symmetric)

The quadratic form identity is the workhorse

Almost every portfolio-optimisation first-order condition reduces to '∂(wᵀΣw)/∂w + (linear term) = 0', which collapses to '2Σw + (linear term) = 0' since Σ is symmetric. Memorise this.

Minimum-variance portfolio derivation

Problem: min (1/2) wᵀΣw subject to 1ᵀw = 1 (weights sum to 1). Lagrangian: L = (1/2) wᵀΣw - λ(1ᵀw - 1). First-order conditions:

math
∂L/∂w = Σw - λ·1 = 0 ⟹ w = λ Σ⁻¹ 1
∂L/∂λ = 1ᵀw - 1 = 0 ⟹ λ = 1 / (1ᵀ Σ⁻¹ 1)
w_MV = Σ⁻¹ 1 / (1ᵀ Σ⁻¹ 1)

Three lines of matrix calculus deliver the global minimum-variance portfolio in closed form. This is why linear algebra is the right language for portfolio theory.

OLS gradient

L(β) = (y - Xβ)ᵀ(y - Xβ) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ. Differentiating:

math
∂L/∂β = -2Xᵀy + 2XᵀXβ = 0
β̂ = (XᵀX)⁻¹ Xᵀy

Trace tricks

Three identities used constantly:

  • tr(AB) = tr(BA), tr(ABC) = tr(BCA) = tr(CAB) — cyclic permutation.
  • ∂ tr(AX)/∂X = Aᵀ
  • ∂ log det(X)/∂X = (X⁻¹)ᵀ (the determinant trick, fundamental in Gaussian MLE)

Gaussian MLE — why the determinant trick matters

The log-likelihood of an MVN sample contains a -½ log det(Σ) term. The derivative with respect to Σ uses ∂ log det(Σ)/∂Σ = (Σ⁻¹)ᵀ = Σ⁻¹ (Σ symmetric). Combined with ∂ tr(SΣ⁻¹)/∂Σ = -Σ⁻¹SΣ⁻¹, the MLE Σ̂ = S (sample covariance) drops out in one line.

Chain rule for matrix derivatives

If h(x) = g(f(x)) where f: Rⁿ → Rᵐ and g: Rᵐ → R: ∇h(x) = (∂f/∂x)ᵀ ∇g(f(x)). For deep-learning aficionados: this is just backprop. For everyone else: it's how you differentiate things like (Xw - y)ᵀ M (Xw - y).

Exercise

Derive the maximum-Sharpe (tangency) portfolio with a riskless rate r_f. Problem: maximise (μᵀw - r_f) / √(wᵀΣw), which is scale-invariant in w, so solve max μᵀw - λ wᵀΣw / 2 with λ chosen by normalisation. Find w∗.

Loading progress…
LeadAfrikPublic Economics Hub