Overfitting in Quantitative Trading

Overfitting is the failure mode in which a trading model is tuned — knowingly or not — to the idiosyncratic noise of a particular historical sample, so it produces an impressive in-sample backtest while having little or no genuine, repeatable edge. Bailey, Borwein, López de Prado and Zhu, in their widely cited Notices of the American Mathematical Society paper “Pseudo-Mathematics and Financial Charlatanism”, prove that this is not an occasional accident but a near-certainty of modern quantitative practice: high simulated performance is easily achievable after backtesting a relatively small number of alternative strategy configurations, and the more configurations tried, the greater the probability the chosen backtest is overfit. They name the practice backtest overfitting and show that, because signal-to-noise ratios in finance are weak, an optimiser scanning millions or billions of variants will reliably surface a configuration that targets the extremes and vagaries of one dataset rather than any general financial structure.

The most consequential result for this vault’s research question is the behaviour out-of-sample. Pure noise-fitting would predict zero expected out-of-sample return; but Bailey et al. show that under memory effects — serial dependence in returns, which real markets exhibit — backtest overfitting produces negative expected out-of-sample returns. The overfit strategy is so closely tied to in-sample noise that it is structurally mis-positioned for the live period. The authors propose this as one explanation for why so many quantitative funds appear to fail after launch. They also derive a Minimum Backtest Length (MinBTL): a function of the number of trials below which the expected maximum in-sample Sharpe ratio exceeds a target threshold even when every candidate’s true Sharpe ratio is zero. A short backtest plus a large search is, by construction, a fabricated edge.

In Markov-model research overfitting appears in many guises, each a hyperparameter that can be (and routinely is) selected to maximise backtest performance: the number of HMM states, the discretisation of Markov-chain price buckets, the regime count of a Markov Regime-Switching Model, the MDP state-space definition, and the RL reward function. Each is a search dimension. The model-specific instances are recorded separately as State-Count Selection, Reward Design Sensitivity and HMM Parameter Instability; the cross-strategy, multiple-testing form is Data-Snooping Bias. Crucially, Bailey et al. note that almost no published backtest declares how many configurations were tried — so readers cannot judge the degree of overfitting, and the vault treats an undisclosed search as a strong downgrade signal.

The standard defence, Out-of-Sample Backtesting, is necessary but insufficient. The companion paper “The Probability of Backtest Overfitting” shows that a single hold-out or train/test split is unreliable for investment backtests: if the data is public the researcher has likely seen the hold-out, hold-out estimates have high variance on the short samples typical in finance, and — decisively — hold-out ignores the number of trials attempted before a configuration was selected. Bailey, Borwein, López de Prado and Zhu instead propose the Probability of Backtest Overfitting (PBO), estimated by combinatorially symmetric cross-validation (CSCV), which swaps all in-sample/out-of-sample partitions and needs only the matrix of backtested returns. Bailey and López de Prado’s Deflated Sharpe Ratio complements this by discounting a reported Sharpe ratio for selection bias and non-normality. The practitioner consensus — see also Campbell Harvey and Yan Liu’s Sharpe-ratio “haircut” work — is that any positive Markov backtest reported without a trial count, a costed out-of-sample test, and an overfitting diagnostic is weak evidence by default.

This note is central to the vault because separating real edge from overfitting is the research question. No Markov-model study reviewed so far reports a PBO, a Deflated Sharpe Ratio, or a disclosed configuration count — which is itself a finding.

Overfitting in Quantitative Trading [causes] Markov Chain Trading Model Overfitting in Quantitative Trading [relates] Data-Snooping Bias Out-of-Sample Backtesting [opposes] Overfitting in Quantitative Trading Deflated Sharpe Ratio [opposes] Overfitting in Quantitative Trading Probability of Backtest Overfitting [defines] Overfitting in Quantitative Trading Pseudo-Mathematics and Financial Charlatanism [defines] Overfitting in Quantitative Trading

Connections

Sources