Skip to content
Module 07 of 1255 min readIntermediate

Hypothesis testing without the cargo cult

What a p-value really says, type I/II errors, power, multiple testing. The hypotheses a desk quant actually tests.

58%

Listen along

Read “Hypothesis testing without the cargo cult” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Hypothesis testing is the formal procedure for deciding whether sample data is consistent with a claim about the population. Done well, it is a rigorous accountability device. Done as ritual — staring at p < 0.05 in isolation — it has produced a generation of papers that didn't replicate and trading strategies that didn't survive contact with markets.

Framework

Specify the null hypothesis H₀ (e.g., μ = 0) and the alternative H₁ (e.g., μ ≠ 0). Compute a test statistic T from the data. Under H₀, T has a known distribution. Define a critical region or compute a p-value; reject H₀ if T falls in the critical region (or p < α).

Type I and Type II errors

  • Type I error: rejecting a true H₀. Probability α (the significance level, typically 5%).
  • Type II error: failing to reject a false H₀. Probability β.
  • Power = 1 - β: probability of rejecting H₀ when it is false.
  • Power depends on sample size, effect size, and α. Underpowered tests can't detect real effects.

What a p-value is and isn't

Definition

The p-value is the probability, under H₀, of observing a test statistic at least as extreme as the one observed. It is not the probability that H₀ is true. It is not the probability that the alternative is false. It is conditional probability of the data given the null — not the other way around.

Tests a quant runs every week

  • t-test on strategy mean: is the realised Sharpe ratio different from zero?
  • F-test on regression: are all slope coefficients jointly zero?
  • Wald test on hypotheses about β: do the coefficients satisfy a linear restriction?
  • Likelihood-ratio test: comparing nested models (e.g., AR(1) vs AR(2)).
  • Kolmogorov-Smirnov / Anderson-Darling: is the residual distribution normal?
  • Ljung-Box: are residuals autocorrelated?
  • Augmented Dickey-Fuller: does a time series have a unit root?

Confidence intervals

A 95% CI is the set of parameter values not rejected by a two-sided test at α = 5%. Interpretation: under repeated sampling, 95% of constructed CIs contain the true parameter. Not: 'there's a 95% probability the true parameter is in this CI' — that's a Bayesian interpretation that requires a prior.

Multiple testing — the killer

Test 20 independent hypotheses at α = 5% each, all true H₀. Expected false positives: 1. Probability of at least one false positive: 1 - 0.95²⁰ ≈ 64%. This is the multiple-testing problem, and it is the single biggest reason backtested strategies don't survive live trading.

  • Bonferroni correction: divide α by the number of tests. Conservative but simple.
  • Holm-Bonferroni: sequential, less conservative than Bonferroni.
  • Benjamini-Hochberg FDR: control the expected fraction of false discoveries, not the family-wise error rate. The empirical-genomics standard.
  • Harvey-Liu-Zhu (2016): a 3-sigma t-stat in published finance research is roughly the new 2-sigma after multiple-testing adjustment.

p-hacking in finance

Trying 200 backtest variations, reporting the best one with p < 0.05, and submitting to a Sharpe-ranking — this is precisely the p-hacking pipeline. The 'effective number of independent tests' in a backtest framework can be in the thousands; pre-registration and out-of-sample validation are the only real fixes.

Effect size matters at least as much as significance

With T = 10⁶ observations, a 0.0001% daily edge is statistically significant at p < 0.01 — but vanishes after transaction costs. Always report effect sizes alongside p-values. A confidence interval communicates both information densities at once.

Exercise

A backtester evaluates 1000 candidate strategies. Each one's null Sharpe is zero; the test is one-sided at α = 5%. Returns are i.i.d. (1) Under H₀ (no real strategies), expected number of 'significant' strategies? (2) If 70 strategies show p < 0.05, is this evidence of skill? (3) What's the Bonferroni-corrected critical p-value?

Loading progress…
LeadAfrikPublic Economics Hub