Unconstrained methods — gradient and Newton — Optimization Module 3

Even constrained optimisation problems reduce, via the Lagrangian, to unconstrained problems in the dual. Understanding the workhorse unconstrained methods — gradient descent, Newton's method, quasi-Newton like BFGS — is the foundation for everything that follows.

Gradient descent

math

x_{k+1} = x_k - α_k ∇f(x_k)

Step in the direction of steepest descent. Simple, robust, scales to millions of variables. The step size α_k is critical: too large and you overshoot, too small and you crawl.

Step-size rules

Constant: α_k = α. Easy but requires tuning. Convergence iff α < 2/L for L-smooth functions.
Backtracking line search: shrink α until Armijo condition f(x - α ∇f) ≤ f(x) - c α ‖∇f‖² is met (c ∈ (0, 1/2)).
Exact line search: minimise f along the descent direction. Cheap for quadratics.
Wolfe conditions: combine Armijo with curvature condition. Standard for quasi-Newton.

Convergence rate of gradient descent

For an L-smooth, μ-strongly convex function (so μ I ⪯ ∇²f ⪯ L I) with constant step α = 1/L:

math

f(x_k) - f(x*) ≤ ((L - μ)/(L + μ))^(2k) (f(x_0) - f(x*))

Linear convergence rate depending on the condition number κ = L/μ. Ill-conditioned problems (large κ) converge slowly — this is the optimisation analogue of the linear-algebra conditioning issue.

Newton's method

math

x_{k+1} = x_k - (∇²f(x_k))⁻¹ ∇f(x_k)

Use the Hessian to choose both direction and step size. Locally quadratic convergence near a minimum (error squares each iteration). Major cost: computing and inverting the Hessian, O(n³) per step.

Quasi-Newton — BFGS

Build an approximation B_k to the Hessian using only gradient evaluations. The BFGS update (Broyden-Fletcher-Goldfarb-Shanno, 1970) is the dominant choice:

math

s_k = x_{k+1} - x_k
y_k = ∇f(x_{k+1}) - ∇f(x_k)
H_{k+1} = (I - s_k y_kᵀ / y_kᵀ s_k) H_k (I - y_k s_kᵀ / y_kᵀ s_k) + s_k s_kᵀ / y_kᵀ s_k

Maintains a PD approximation H_k ≈ (∇²f)⁻¹. Convergence is super-linear (between linear and quadratic). Cost per iteration is O(n²) for vector operations, no matrix factorisations. The default unconstrained solver in scipy.optimize and most numerical-analysis libraries.

Limited-memory BFGS — L-BFGS

For large n (millions of variables), even O(n²) storage is too much. L-BFGS stores only the last m pairs (s_k, y_k), implicitly representing the inverse-Hessian approximation. m = 5-20 is typical. Used everywhere from logistic regression to deep-learning fine-tuning.

Stochastic and accelerated variants

SGD (stochastic gradient descent): use mini-batch gradients. Dominant in ML; useful in finance for very large data.
Momentum / heavy-ball: x_{k+1} = x_k - α ∇f + β (x_k - x_{k-1}). Polyak's method; faster than vanilla GD on ill-conditioned problems.
Nesterov acceleration: clever look-ahead momentum that achieves O(1/k²) instead of O(1/k) for smooth convex problems.
Adam, RMSProp: adaptive step sizes per coordinate. ML defaults.

When does first-order win, when does second-order win?

Newton dominates when n is small and Hessian is cheap. Calibration problems with ~30 parameters: Newton converges in 5-20 iterations.
L-BFGS / quasi-Newton dominate when n is moderate (100-10⁵). Most finance optimisations.
SGD-variants dominate when data is enormous and exact gradients are too expensive. Large-scale ML.

Exercise

Minimise f(x) = (1/2) xᵀAx - bᵀx for A = [[4, 1], [1, 3]], b = (1, 2). (1) Find the analytic minimum. (2) Take one Newton step from x_0 = (0, 0); show it reaches the minimum exactly. (3) Why?