Recurrent Reinforcement Learning Trading
Recurrent reinforcement learning (RRL), also called direct reinforcement, is the trading approach introduced by John Moody and Matthew Saffell in which a recurrent policy maps the current market state directly to a trade position, and is trained by gradient ascent to maximise a financial performance function — profit, the Sharpe ratio, or the Differential Sharpe Ratio — net of transaction costs. It is the actor-only ancestor of modern deep RL trading: it skips an explicit price-forecasting step and optimises the trading objective end-to-end. The recurrence lets the policy condition on its own previous position, which matters because transaction costs depend on position changes. RRL was first set out empirically in Moody Wu Liao Saffell 1998 and consolidated as a journal method in Moody and Saffell 2001.
Moody and Saffell 2001 [defines] Recurrent Reinforcement Learning Trading Recurrent Reinforcement Learning Trading [part-of] Reinforcement Learning Trading Policy
Mechanically, RRL trains a single-layer recurrent network whose output F_t in [-1, 1] (or {short, flat, long}) is the position; the parameters are updated online by real-time recurrent learning (RTRL), differentiating the chosen performance function with respect to the weights through the recurrent dependence F_t = f(theta; F_{t-1}, prices, indicators). Moody & Saffell argued this is structurally superior to two alternatives: supervised “forecast then trade” pipelines (which optimise a forecast-error proxy, not P&L) and value-function reinforcement learning such as Q-learning (which suffers Bellman’s curse of dimensionality and needs a value-function estimate). Their Journal of Forecasting and IEEE results report RRL beating both Q-learning and MSE-trained forecasters on real data.
Recurrent Reinforcement Learning Trading [contradicts] Buy-and-Hold Benchmark Recurrent Reinforcement Learning Trading [relates] Transaction Costs and Slippage
On profitability the strategy must be graded carefully. The foundational evidence is genuinely out-of-sample and post-cost — the S&P 500 / T-Bill allocator over a 25-year test (1970-1994) with a 0.5% cost, and the USD/GBP intra-daily trader on 1996 data with bid/ask spreads — and on those tests RRL reportedly outperformed buy-and-hold and Q-learning (USD/GBP annualised Sharpe ~2.3). That places RRL above pure in-sample backtests. But the foundational data ends in the mid-1990s, the studies are single-group with thin samples and no reported drawdown or robustness statistics, and no code was released. The strategy is widely re-used (e.g. Carl Gold’s FX RRL work, Almahdi & Yang’s drawdown-objective portfolios, Dempster & Leemans) but those re-uses vary the objective and instruments rather than cleanly replicating the original headline returns.
Borrageiro Firoozye Barucca 2022 [contradicts] Recurrent Reinforcement Learning Trading
The most rigorous modern check is Borrageiro Firoozye Barucca 2022 (IEEE Access): a direct-RRL agent trading the major spot FX pairs over a seven-year out-of-sample window with carefully modelled transaction and funding costs achieves an annualised information ratio of only 0.52 and a 9.3% compound return — roughly a quarter of the ~2.3 Sharpe of the 1996 USD/GBP study. The reading is that RRL is a real, theoretically attractive method that can be slightly profitable net of costs over long horizons, but the dramatic risk-adjusted returns of the 1998-2001 papers do not reproduce. Net profitability_evidence_grade: moderate — out-of-sample and cost-aware foundational evidence, but limited robustness testing and no independent replication at the original strength; later deep variants (FDDR, LSTM/GRU encoders) inherit the same overfitting and Non-Stationarity problems.
Connections
- Moody and Saffell 2001 — proposes_model, 2001, source: https://ieeexplore.ieee.org/document/935097/
- Moody Wu Liao Saffell 1998 — proposes_model, 1998, source: https://link.springer.com/chapter/10.1007/978-1-4615-5625-1_10
- John Moody — proposes_model, source: https://ieeexplore.ieee.org/document/935097/
- Matthew Saffell — proposes_model, source: https://ieeexplore.ieee.org/document/935097/
- Reinforcement Learning Trading Policy — optimises_policy, source: https://ieeexplore.ieee.org/document/935097/
- Differential Sharpe Ratio — optimises_policy (used as the reward objective), source: https://proceedings.neurips.cc/paper_files/paper/1998/hash/4e6cd95227cb0c280e99a195be5f6615-Abstract.html
- S&P 500 — trades_market / reports_profitability, 1970-1994, source: https://proceedings.neurips.cc/paper_files/paper/1998/hash/4e6cd95227cb0c280e99a195be5f6615-Abstract.html
- Transaction Costs and Slippage — includes_costs, source: https://ieeexplore.ieee.org/document/935097/
- Buy-and-Hold Benchmark — compares_benchmark, source: https://proceedings.neurips.cc/paper_files/paper/1998/hash/4e6cd95227cb0c280e99a195be5f6615-Abstract.html
- Borrageiro Firoozye Barucca 2022 — reports_underperformance / replication_missing, 2022, source: https://arxiv.org/abs/2110.04745
Sources
- Learning to trade via direct reinforcement (Moody & Saffell, IEEE Trans. Neural Networks 2001)
- Reinforcement Learning for Trading (Moody & Saffell, NIPS 1998)
- Reinforcement Learning for Trading Systems and Portfolios (Springer)
- Reinforcement Learning for Systematic FX Trading (Borrageiro et al., IEEE Access 2022)