Out-of-Sample Backtesting
Out-of-sample (OOS) backtesting evaluates a trading model on a segment of data that played no part in fitting or tuning it — via a held-out test period, a train/test split, or rolling validation. It is the primary defence against Overfitting in Quantitative Trading: a model with a genuine edge should retain it on unseen data, whereas an overfitted one decays. The simplest form, a single train/test split, is also the weakest: it tests only one held-out window, and once that window’s results are inspected and the model revised, the OOS period has been “consumed” — every revision after seeing it converts it back into an in-sample set. This is the precise mechanism of Data-Snooping Bias, and it is why the vault treats a one-shot hold-out as only a partial control.
Walk-forward analysis (rolling-origin validation) improves on the static split by fitting the model on a training window, testing on the next window, then rolling both forward and repeating, so the strategy must “prove itself repeatedly across multiple out-of-sample test periods.” Marcos López de Prado nonetheless identifies three structural flaws in walk-forward backtesting: (1) it tests only a single historical path, “which can be easily overfit”; (2) it is biased by the particular sequence of observations — fitting on the reversed series often yields a contradictory result, and “the fact that changing the sequence of observations yields inconsistent outcomes is evidence of that overfitting”; (3) its early decisions rest on a small slice of data, so a few observations carry disproportionate weight and the resulting Sharpe estimate has high variance. High variance is itself a source of false discovery: a researcher selecting the best of many walk-forward backtests will pick an inflated Sharpe even when the true Sharpe is zero.
Standard k-fold cross-validation cannot rescue OOS evaluation in finance, because it assumes independent, identically distributed observations. Financial features are serially correlated and labels are formed on overlapping horizons, so placing adjacent points in different folds leaks information: a classifier can appear skilful “even if X is an irrelevant feature.” López de Prado’s purged cross-validation removes this leakage with two operations — purging (dropping training observations whose label-formation window overlaps a test label) and embargoing (dropping a small buffer of training observations, typically ~1-5%, immediately after each test fold to block leakage from market-reaction lag and autocorrelation). His Combinatorial Purged Cross-Validation (CPCV) goes further: it partitions the series into N groups, takes all C(N,k) train/test combinations, and reconstructs φ[N,k]=(k/N)·C(N,k) distinct backtest paths, yielding a whole distribution of Sharpe ratios instead of one likely-overfit point estimate. A synthetic-controlled comparison by Arian, Seco and López de Prado found CPCV produced the lowest Probability of Backtest Overfitting and the highest Deflated Sharpe Ratio statistic, while walk-forward showed “increased temporal variability and weaker stationarity.”
The deepest limit is that no OOS test fully escapes Non-Stationarity. A historical hold-out is still drawn from the past; if the data-generating process shifts — as markets, demonstrably, do — even an honest OOS result is evidence about a regime that may not recur. This is acute for Markov models, whose first-order transition matrices are themselves estimated on a window and are unstable across regimes (Non-Stationary Transition Matrix, HMM Parameter Instability). OOS testing also does not, by itself, control for the number of strategies tried: Pseudo-Mathematics and Financial Charlatanism show that high in-sample Sharpe ratios are achievable after only a handful of trials, derive a Minimum Backtest Length that must grow with the number of configurations, and prove the most damaging result for this vault — under memory effects in financial series, an overfit backtest produces negative, not merely zero, expected OOS returns. Backtest overfitting “is a particularly expensive form of selection bias.”
For grading purposes, this concept is the evidential backbone of the vault. A purely in-sample Markov-model backtest is weak evidence of profitability by default — the lowest grade absent other defects. A genuine OOS test, especially walk-forward across multiple regimes with realistic Transaction Costs and Slippage and a benchmark, can lift a result to moderate. The strong grade additionally demands robustness tests and independent replication. The single most diagnostic question — per Pseudo-Mathematics and Financial Charlatanism — is how many configurations were tried: “a backtest where the researcher has not controlled for the extent of the search involved in his or her finding is worthless, regardless of how excellent the reported performance might be.” Reusing a published OOS dataset across many later studies silently rebuilds that uncontrolled search, which is why the vault privileges true OOS design, multiple-testing-aware statistics (Deflated Sharpe Ratio), and independent replication together.
Out-of-Sample Backtesting [opposes] Overfitting in Quantitative Trading Out-of-Sample Backtesting [contradicts] Data-Snooping Bias Combinatorial Purged Cross-Validation [supports] Out-of-Sample Backtesting Marcos López de Prado [defines] Combinatorial Purged Cross-Validation Non-Stationarity [opposes] Out-of-Sample Backtesting Out-of-Sample Backtesting [relates] Deflated Sharpe Ratio
Connections
- Overfitting in Quantitative Trading — contradicts, source: https://ssrn.com/abstract=2308659
- Data-Snooping Bias — contradicts, source: https://ssrn.com/abstract=2460551
- Combinatorial Purged Cross-Validation — replication_available, source: https://en.wikipedia.org/wiki/Purged_cross-validation
- Deflated Sharpe Ratio — relates, source: https://ssrn.com/abstract=2460551
- Pseudo-Mathematics and Financial Charlatanism — compares_benchmark, source: https://ssrn.com/abstract=2308659
- Non-Stationarity — contradicts, source: https://www.garp.org/hubfs/Whitepapers/a1Z1W0000054x6lUAA.pdf
- Transaction Costs and Slippage — relates, source: https://ssrn.com/abstract=2308659
- Regime Classification — supports, source: https://en.wikipedia.org/wiki/Purged_cross-validation
Sources
- Bailey, D.H., Borwein, J.M., López de Prado, M. & Zhu, Q.J. (2014). “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance.” Notices of the American Mathematical Society, 61(5), 458-471. https://ssrn.com/abstract=2308659
- López de Prado, M. (2018). “The 10 Reasons Most Machine Learning Funds Fail.” The Journal of Portfolio Management, 44(6), 120-133. https://www.garp.org/hubfs/Whitepapers/a1Z1W0000054x6lUAA.pdf
- Bailey, D.H. & López de Prado, M. (2014). “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management, 40(5), 94-107. https://ssrn.com/abstract=2460551
- “Purged cross-validation.” Wikipedia (summarising López de Prado, Advances in Financial Machine Learning, 2018). https://en.wikipedia.org/wiki/Purged_cross-validation