Zhang Zohren Roberts 2019

“Deep Reinforcement Learning for Trading” by Zihao Zhang, Stefan Zohren and Stephen Roberts of the Oxford-Man Institute of Quantitative Finance, University of Oxford, first appeared as arXiv:1911.10107 in November 2019 and was published in The Journal of Financial Data Science 2(2):25-40 in March 2020 (DOI 10.3905/jfds.2020.1.030). It is the strongest positive evidence cited anywhere in this vault for reinforcement-learning trading: an out-of-sample futures backtest that beats classical momentum baselines on risk-adjusted return and stays profitable after a realistic transaction-cost rate. Its central methodological move is to train agents that output a trade position directly — bypassing the usual two-step “predict the return, then map the prediction to a position” pipeline — by formulating trading as a Markov Decision Process and letting a reinforcement-learning policy optimise expected cumulative reward.

The experimental design is more rigorous than most RL-trading papers. The data are 50 ratio-adjusted continuous futures contracts from the Pinnacle Data Corp CLC Database — 25 commodity, 11 equity-index, 5 fixed-income and 9 FX contracts — spanning 2005-2019. Models are retrained on an expanding window every five years, with parameters frozen for the following five-year block to produce genuinely out-of-sample results over a 2011-2019 test window. The state vector combines past price, volatility-normalised returns over 1-month to 1-year horizons, and MACD/RSI indicators over a 60-step lookback; the reward includes a volatility-scaling term (positions scale up in calm regimes, down in turbulent ones) and an explicit transaction-cost term, so frictions enter the objective the agent actually optimises rather than being bolted on afterward. Three algorithms are compared: Deep Q-Network (critic-only, discrete actions), Policy Gradient (actor-only) and Advantage Actor-Critic (continuous actions), all built on two-layer LSTM networks.

Zhang Zohren Roberts 2019 [proposes_model] Markov Decision Process Trading Model Zhang Zohren Roberts 2019 [tests_strategy] Reinforcement Learning Trading Policy Zhang Zohren Roberts 2019 [trades_market] Futures Markets

The headline result is a clear risk-adjusted win for the RL agents. On the all-contracts portfolio with portfolio-level volatility targeting, annualised Sharpe ratios were DQN 1.288, A2C 1.050 and PG 0.754, against 0.441 for Sign(R) time-series momentum, 0.091 for MACD and 0.058 for a long-only buy-and-hold. DQN also posted the best Sortino (2.220) and Calmar (1.025) ratios. A transaction-cost robustness check — plotting portfolio Sharpe against a rising cost rate — shows DQN and A2C still generating positive profit at a 25 basis-point cost rate, roughly $3.5 per contract, which the authors describe as realistic for a retail trader (institutions typically pay less). Per-contract boxplots confirm the result is not driven by a single outlier instrument. This is a properly costed, properly out-of-sample backtest that beats sensible benchmarks — which is why it earns a moderate rather than a weak grade.

But several gaps keep it well short of strong, and the paper itself supplies the most important caveat. The edge is regime-dependent: on the equity-index sub-portfolio a simple long-only strategy beat the RL agents, because the 2011-2019 test window was dominated by a sustained equity bull market and the agents’ ability to go short or flat was a liability there. The RL advantage concentrated in commodities and FX, where two-sided positioning has room to add value. This is direct evidence that the result is conditional on the asset mix and the macro regime of one historical window, not a universal alpha. Beyond regime dependence: there is no live or paper-trading evidence; Slippage and market impact are not modelled (the cost term is a flat per-trade rate, the assumption Transaction Costs and Slippage flags as structurally optimistic for any non-trivial size); no maximum-drawdown table is published; no replication package is released, so independent verification is not possible; and — critically for this vault’s research question — the paper applies no multiple-testing control. It does not disclose how many network architectures, lookback windows, reward parameterisations or hyperparameter settings were searched, and reports no Probability of Backtest Overfitting or Deflated Sharpe Ratio. As Overfitting in Quantitative Trading establishes, an undisclosed configuration search is a standing downgrade signal: a Sharpe of 1.288 with an unknown trial count cannot be distinguished from a selection-bias artefact on the evidence given.

Zhang Zohren Roberts 2019 [reports_profitability] Reinforcement Learning Trading Policy Zhang Zohren Roberts 2019 [includes_costs] Transaction Costs and Slippage Buy-and-Hold Benchmark [contradicts] Zhang Zohren Roberts 2019 Sim-to-Real Gap [opposes] Zhang Zohren Roberts 2019

The net verdict is a moderate profitability grade — the highest any positive RL-trading paper reaches in this vault, but capped firmly there. The study clears the out-of-sample, transaction-cost and benchmark bars, yet fails the robustness and reproducibility bars required for strong: it is a single academic backtest on one futures universe over one regime window, with no live track record, no slippage model, no drawdown reporting, no replication code, and no overfitting diagnostic. It demonstrates that a Markov-Decision-Process framing of trading can produce a costed out-of-sample edge in a controlled study — it does not demonstrate that the edge is real, persistent, or capturable by an actual trader. Read alongside Gort et al. 2022, the pair frames the vault’s RL evidence honestly: one carefully costed positive backtest, one paper showing how easily such backtests are overfitted false positives.

Connections

Sources