Backtest-to-Live Performance Gap

The backtest-to-live performance gap is the systematic, repeatedly measured deterioration of a trading strategy’s returns between its published or simulated backtest and its performance once deployed with real capital. It is not an occasional accident but a structural regularity of quantitative finance: the gap is large, it is documented across asset-pricing anomalies, machine-learning funds and AI-driven hedge funds, and its causes are understood. For this vault the gap is decisive, because every Markov-model profitability claim gathered so far is a backtest or simulation. If a credible live Markov-model track record existed, it would dominate any backtest as evidence; its complete absence, set against a well-documented gap, is itself strong negative evidence on the standalone-profitability question.

The single most direct measurement comes from McLean and Pontiff’s “Does Academic Research Destroy Stock Return Predictability?” (Journal of Finance, 2016) — the McLean and Pontiff 2016 note. Replicating 97 cross-sectional return predictors published in peer-reviewed journals, they find hedge-portfolio returns are 26% lower out-of-sample (after the original sample ends but before publication) and 58% lower post-publication than in-sample. The ~32-percentage-point gap between those two figures is attributable to investors learning the anomaly and trading it away; the residual 26% is statistical bias exposed once the lucky sample window ends. Decay is largest three to four years after publication and worst for predictors with higher in-sample returns and purely technical (price/volume) signals — precisely the profile of many Markov and regime-switching trading rules. Their conclusion is blunt: an investor should expect to capture well under half the gross return a published study reports, and less again after trading frictions. Subsequent meta-studies — Chen and Zimmermann 2022 and Chen & Zimmermann’s “Publication Bias in Asset Pricing Research” — confirm the post-publication decay while debating how much is data mining versus genuine mispricing being arbitraged; either way the forward-looking lesson is the same: published numbers overstate live returns.

The mechanism is not unique to factor anomalies. Marcos López de Prado’s “The 10 Reasons Most Machine Learning Funds Fail” (Journal of Portfolio Management, 2018) — the The 10 Reasons Most Machine Learning Funds Fail note — describes the gap from inside the industry: “research through backtesting” (iterating an ML model until a nice backtest appears), reliance on a single walk-forward path, and undisclosed trial counts all manufacture an in-sample result that does not survive contact with a live market. His earlier “Pseudo-Mathematics and Financial Charlatanism” proves the sharpest version: under the memory effects real markets exhibit, an overfit backtest produces negative, not merely zero, expected out-of-sample returns — the gap can flip a strategy’s sign. This is the same failure the vault records as Overfitting in Quantitative Trading, Data-Snooping Bias and the Replication Crisis in Quantitative Finance; the backtest-to-live gap is their observable consequence at the level of fund performance.

The aggregate live evidence corroborates the mechanism. Buczynski, Cuzzolin and Sahakian’s review of 27 ML market-forecasting experiments (2021) found a literature claiming forecasting accuracy routinely above 90% but “conspicuously lacking in high-profile success cases” — while high-profile failures (Aidyia, Sentient Technologies) liquidated within a year or two of launch. The Eurekahedge AI Hedge Fund Index of ML-driven funds returned 9.8% annualised from December 2009 to July 2024 versus 13.7% for the S&P 500, and underperformed the S&P 500 and MSCI World cumulatively over 2011–2020. If the academic accuracy claims translated to live trading, ML funds would dominate; they do not. The review attributes the divergence largely to “cherry-picking” — running an average of 70.7 model configurations and reporting only the best — and to the omission of trading costs and realistic execution. This is the Sim-to-Real Gap (the reinforcement-learning-specific instance) generalised across the whole quantitative literature, and it is why the vault collects a dedicated Live Trading Evidence note: live, disclosed, costed track records are the missing top-tier evidence for every model class here.

For grading purposes this concept sets the vault’s most important prior. A backtest — even an honest out-of-sample one with a benchmark — is evidence about a historical draw; the documented gap means the expected live return is materially lower, and under overfitting can be negative. Therefore no Markov-model paper or backtest in this vault can be graded strong (the tier reserved for out-of-sample profitability with realistic costs, robustness tests and independent replication or live confirmation) on backtest evidence alone, and the absence of any credible live Markov track record caps the standalone-profitability verdict at inconclusive-to-negative rather than merely “unproven”. The gap is the reason “no live evidence” is treated here not as a neutral data gap but as a substantive, theory-backed negative.

McLean and Pontiff 2016 [supports] Backtest-to-Live Performance Gap The 10 Reasons Most Machine Learning Funds Fail [supports] Backtest-to-Live Performance Gap Overfitting in Quantitative Trading [causes] Backtest-to-Live Performance Gap Data-Snooping Bias [causes] Backtest-to-Live Performance Gap Transaction Costs and Slippage [causes] Backtest-to-Live Performance Gap Sim-to-Real Gap [part-of] Backtest-to-Live Performance Gap Backtest-to-Live Performance Gap [contradicts] Out-of-Sample Backtesting Backtest-to-Live Performance Gap [relates] Replication Crisis in Quantitative Finance

Connections

Sources