Backtest-to-Live Performance Gap
The backtest-to-live performance gap is the systematic, repeatedly measured deterioration of a trading strategy’s returns between its published or simulated backtest and its performance once deployed with real capital. It is not an occasional accident but a structural regularity of quantitative finance: the gap is large, it is documented across asset-pricing anomalies, machine-learning funds and AI-driven hedge funds, and its causes are understood. For this vault the gap is decisive, because every Markov-model profitability claim gathered so far is a backtest or simulation. If a credible live Markov-model track record existed, it would dominate any backtest as evidence; its complete absence, set against a well-documented gap, is itself strong negative evidence on the standalone-profitability question.
The single most direct measurement comes from McLean and Pontiff’s “Does Academic Research Destroy Stock Return Predictability?” (Journal of Finance, 2016) — the McLean and Pontiff 2016 note. Replicating 97 cross-sectional return predictors published in peer-reviewed journals, they find hedge-portfolio returns are 26% lower out-of-sample (after the original sample ends but before publication) and 58% lower post-publication than in-sample. The ~32-percentage-point gap between those two figures is attributable to investors learning the anomaly and trading it away; the residual 26% is statistical bias exposed once the lucky sample window ends. Decay is largest three to four years after publication and worst for predictors with higher in-sample returns and purely technical (price/volume) signals — precisely the profile of many Markov and regime-switching trading rules. Their conclusion is blunt: an investor should expect to capture well under half the gross return a published study reports, and less again after trading frictions. Subsequent meta-studies — Chen and Zimmermann 2022 and Chen & Zimmermann’s “Publication Bias in Asset Pricing Research” — confirm the post-publication decay while debating how much is data mining versus genuine mispricing being arbitraged; either way the forward-looking lesson is the same: published numbers overstate live returns.
The mechanism is not unique to factor anomalies. Marcos López de Prado’s “The 10 Reasons Most Machine Learning Funds Fail” (Journal of Portfolio Management, 2018) — the The 10 Reasons Most Machine Learning Funds Fail note — describes the gap from inside the industry: “research through backtesting” (iterating an ML model until a nice backtest appears), reliance on a single walk-forward path, and undisclosed trial counts all manufacture an in-sample result that does not survive contact with a live market. His earlier “Pseudo-Mathematics and Financial Charlatanism” proves the sharpest version: under the memory effects real markets exhibit, an overfit backtest produces negative, not merely zero, expected out-of-sample returns — the gap can flip a strategy’s sign. This is the same failure the vault records as Overfitting in Quantitative Trading, Data-Snooping Bias and the Replication Crisis in Quantitative Finance; the backtest-to-live gap is their observable consequence at the level of fund performance.
The aggregate live evidence corroborates the mechanism. Buczynski, Cuzzolin and Sahakian’s review of 27 ML market-forecasting experiments (2021) found a literature claiming forecasting accuracy routinely above 90% but “conspicuously lacking in high-profile success cases” — while high-profile failures (Aidyia, Sentient Technologies) liquidated within a year or two of launch. The Eurekahedge AI Hedge Fund Index of ML-driven funds returned 9.8% annualised from December 2009 to July 2024 versus 13.7% for the S&P 500, and underperformed the S&P 500 and MSCI World cumulatively over 2011–2020. If the academic accuracy claims translated to live trading, ML funds would dominate; they do not. The review attributes the divergence largely to “cherry-picking” — running an average of 70.7 model configurations and reporting only the best — and to the omission of trading costs and realistic execution. This is the Sim-to-Real Gap (the reinforcement-learning-specific instance) generalised across the whole quantitative literature, and it is why the vault collects a dedicated Live Trading Evidence note: live, disclosed, costed track records are the missing top-tier evidence for every model class here.
For grading purposes this concept sets the vault’s most important prior. A backtest — even an honest out-of-sample one with a benchmark — is evidence about a historical draw; the documented gap means the expected live return is materially lower, and under overfitting can be negative. Therefore no Markov-model paper or backtest in this vault can be graded strong (the tier reserved for out-of-sample profitability with realistic costs, robustness tests and independent replication or live confirmation) on backtest evidence alone, and the absence of any credible live Markov track record caps the standalone-profitability verdict at inconclusive-to-negative rather than merely “unproven”. The gap is the reason “no live evidence” is treated here not as a neutral data gap but as a substantive, theory-backed negative.
McLean and Pontiff 2016 [supports] Backtest-to-Live Performance Gap The 10 Reasons Most Machine Learning Funds Fail [supports] Backtest-to-Live Performance Gap Overfitting in Quantitative Trading [causes] Backtest-to-Live Performance Gap Data-Snooping Bias [causes] Backtest-to-Live Performance Gap Transaction Costs and Slippage [causes] Backtest-to-Live Performance Gap Sim-to-Real Gap [part-of] Backtest-to-Live Performance Gap Backtest-to-Live Performance Gap [contradicts] Out-of-Sample Backtesting Backtest-to-Live Performance Gap [relates] Replication Crisis in Quantitative Finance
Connections
- McLean and Pontiff 2016 — reports_underperformance, anomaly returns 26% lower OOS / 58% lower post-publication, source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2156623
- The 10 Reasons Most Machine Learning Funds Fail — reports_underperformance, practitioner catalogue of why ML funds fail live, source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3104816
- Chen and Zimmermann 2022 — replication_available, confirms post-publication decay across 319 predictors, source: https://www.openassetpricing.com/
- Live Trading Evidence — lacks_live_evidence, no credible live Markov track record exists, source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8019690/
- AI Hedge Fund Index Underperformance — reports_underperformance, Eurekahedge AI index lags S&P 500 in live data, source: https://www.ig.com/za/prime/insights/articles/has-artificial-intelligences-impact-on-hedge-funds-been-overhype-241121
- Sim-to-Real Gap — relates, the RL-specific instance of this gap, source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8019690/
- Overfitting in Quantitative Trading — relates, overfitting is a primary cause of the gap, source: https://www.garp.org/hubfs/Whitepapers/a1Z1W0000054x6lUAA.pdf
- Data-Snooping Bias — relates, multiple-testing selection inflates the backtest the gap is measured against, source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2156623
- Transaction Costs and Slippage — relates, omitted costs widen the gap, source: https://pmc.ncbi.nlm.nih.gov/articles/PMC8019690/
- Replication Crisis in Quantitative Finance — relates, un-replicable backtests cannot be checked against live results, source: https://arxiv.org/abs/2209.13623
- Out-of-Sample Backtesting — contradicts, even honest OOS results decay live, source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2156623
Sources
- McLean, R. D., & Pontiff, J. (2016). “Does Academic Research Destroy Stock Return Predictability?” Journal of Finance, 71(1), 5–32. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2156623 — DOI 10.1111/jofi.12365
- López de Prado, M. (2018). “The 10 Reasons Most Machine Learning Funds Fail.” Journal of Portfolio Management, 44(6), 120–133. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3104816 — full text https://www.garp.org/hubfs/Whitepapers/a1Z1W0000054x6lUAA.pdf
- Buczynski, W., Cuzzolin, F., & Sahakian, B. (2021). “A review of machine learning experiments in equity investment decision-making: why most published research findings do not live up to their promise in real life.” International Journal of Data Science and Analytics, 11(3), 221–242. https://pmc.ncbi.nlm.nih.gov/articles/PMC8019690/
- IG Prime (2024). “Has artificial intelligence’s impact on hedge funds been overhyped?” (citing Eurekahedge AI Hedge Fund Index / Hulbert Ratings). https://www.ig.com/za/prime/insights/articles/has-artificial-intelligences-impact-on-hedge-funds-been-overhype-241121
- Chen, A. Y., & Zimmermann, T. (2022). “Open Source Cross-Sectional Asset Pricing.” Critical Finance Review, 11(2), 207–264. https://www.openassetpricing.com/ — and “Publication Bias in Asset Pricing Research.” https://arxiv.org/abs/2209.13623