Hypothesis testing is the formal procedure for deciding whether sample data is consistent with a claim about the population. Done well, it is a rigorous accountability device. Done as ritual — staring at p < 0.05 in isolation — it has produced a generation of papers that didn't replicate and trading strategies that didn't survive contact with markets.
Framework
Specify the null hypothesis H₀ (e.g., μ = 0) and the alternative H₁ (e.g., μ ≠ 0). Compute a test statistic T from the data. Under H₀, T has a known distribution. Define a critical region or compute a p-value; reject H₀ if T falls in the critical region (or p < α).
Type I and Type II errors
- Type I error: rejecting a true H₀. Probability α (the significance level, typically 5%).
- Type II error: failing to reject a false H₀. Probability β.
- Power = 1 - β: probability of rejecting H₀ when it is false.
- Power depends on sample size, effect size, and α. Underpowered tests can't detect real effects.
What a p-value is and isn't
Definition
The p-value is the probability, under H₀, of observing a test statistic at least as extreme as the one observed. It is not the probability that H₀ is true. It is not the probability that the alternative is false. It is conditional probability of the data given the null — not the other way around.
Tests a quant runs every week
- t-test on strategy mean: is the realised Sharpe ratio different from zero?
- F-test on regression: are all slope coefficients jointly zero?
- Wald test on hypotheses about β: do the coefficients satisfy a linear restriction?
- Likelihood-ratio test: comparing nested models (e.g., AR(1) vs AR(2)).
- Kolmogorov-Smirnov / Anderson-Darling: is the residual distribution normal?
- Ljung-Box: are residuals autocorrelated?
- Augmented Dickey-Fuller: does a time series have a unit root?
Confidence intervals
A 95% CI is the set of parameter values not rejected by a two-sided test at α = 5%. Interpretation: under repeated sampling, 95% of constructed CIs contain the true parameter. Not: 'there's a 95% probability the true parameter is in this CI' — that's a Bayesian interpretation that requires a prior.
Multiple testing — the killer
Test 20 independent hypotheses at α = 5% each, all true H₀. Expected false positives: 1. Probability of at least one false positive: 1 - 0.95²⁰ ≈ 64%. This is the multiple-testing problem, and it is the single biggest reason backtested strategies don't survive live trading.
- Bonferroni correction: divide α by the number of tests. Conservative but simple.
- Holm-Bonferroni: sequential, less conservative than Bonferroni.
- Benjamini-Hochberg FDR: control the expected fraction of false discoveries, not the family-wise error rate. The empirical-genomics standard.
- Harvey-Liu-Zhu (2016): a 3-sigma t-stat in published finance research is roughly the new 2-sigma after multiple-testing adjustment.
p-hacking in finance
Trying 200 backtest variations, reporting the best one with p < 0.05, and submitting to a Sharpe-ranking — this is precisely the p-hacking pipeline. The 'effective number of independent tests' in a backtest framework can be in the thousands; pre-registration and out-of-sample validation are the only real fixes.
Effect size matters at least as much as significance
With T = 10⁶ observations, a 0.0001% daily edge is statistically significant at p < 0.01 — but vanishes after transaction costs. Always report effect sizes alongside p-values. A confidence interval communicates both information densities at once.
Exercise
A backtester evaluates 1000 candidate strategies. Each one's null Sharpe is zero; the test is one-sided at α = 5%. Returns are i.i.d. (1) Under H₀ (no real strategies), expected number of 'significant' strategies? (2) If 70 strategies show p < 0.05, is this evidence of skill? (3) What's the Bonferroni-corrected critical p-value?