Floating point, conditioning, and stability — Numerical Methods Module 1

Numerical methods are where the maths meets the machine. Every formula in this course will eventually be evaluated in floating-point arithmetic, and every approximation will accumulate error. Understanding floating-point representation, machine epsilon, and conditioning is the prerequisite for trusting any number a computer hands you.

IEEE 754 double precision

A double-precision float uses 64 bits: 1 sign + 11 exponent + 52 mantissa. This gives roughly 15-17 significant decimal digits and a range of about ±10^308. Machine epsilon (the smallest x such that 1 + x ≠ 1 in fp) is ε_mach ≈ 2.22 × 10⁻¹⁶.

Three sources of error

Representation error: not every real number can be exactly represented. 0.1 in binary is a repeating fraction, so 0.1 + 0.2 ≠ 0.3 exactly.
Round-off error: arithmetic operations introduce tiny errors of relative size ≈ ε_mach. Accumulated over many operations, these can grow.
Cancellation error: subtracting two near-equal numbers loses precision. Computing (1 + 10⁻¹⁵) - 1 returns ~10⁻¹⁵ correctly but with much fewer significant digits than the inputs.

Catastrophic cancellation in finance

Classic example: the variance computed as E[X²] - E[X]² when these two terms are nearly equal. The naive formula loses precision because both terms are large but their difference is small. The two-pass / Welford algorithm computes the variance directly without subtracting near-equal quantities — always use it.

Condition number

For a function f, the relative condition number is |f'(x) · x / f(x)|. For matrix problems, κ(A) = ‖A‖ · ‖A⁻¹‖. Large condition number means small input changes produce large output changes — the problem is ill-conditioned.

Condition number ε rule of thumb

Solving Ax = b in floating-point loses roughly log₁₀(κ(A)) digits of accuracy. With ε_mach = 10⁻¹⁶ and κ = 10¹⁰, you have only 6 reliable digits in the answer. Always know your κ before trusting numerical results.

Numerical stability of algorithms

An algorithm is backward-stable if the computed answer is the exact answer for a slightly-perturbed input. Most modern numerical-library algorithms (LU with partial pivoting, QR via Householder, SVD via Golub-Kahan-Reinsch) are backward-stable. Naive algorithms (normal equations for least squares, classical Gram-Schmidt for QR) often are not.

Vectorisation and SIMD

Modern CPUs do 4-16 floating-point operations per cycle when data is properly aligned. NumPy, BLAS, and LAPACK exploit this; Python loops do not. The 100× speed gap between numpy.dot and a Python for-loop is the SIMD gap.

Avoiding loss of precision

Use log-space when probabilities are tiny. log(p₁ · p₂ · ... · p_N) = Σ log p_i.
Reorder sums to add similar-magnitude numbers first (Kahan summation for ultimate precision).
Avoid catastrophic cancellation: rewrite (1 - cos(x)) as 2 sin²(x/2) for small x; rewrite √(x² + 1) - x as 1/(√(x² + 1) + x) for large x.
When the input range is unknown, normalize before computing (max-shift trick for softmax/logsumexp).

Single vs double precision

Doubles (64-bit) are the default in scientific Python. Singles (32-bit, ε ≈ 10⁻⁷) save memory and run faster on GPUs. Production option-pricing code typically uses doubles; large-scale ML often uses singles or even half-precision floats with explicit mixed-precision arithmetic.

Exercise

Why does the naive sample variance s² = (1/n) Σ (x_i - x̄)² perform better numerically than the algebraically-equivalent s² = (1/n) Σ x_i² - x̄²? Construct a numerical example to show the difference.