Recurrent Reinforcement Learning Trading

Recurrent reinforcement learning (RRL), also called direct reinforcement, is the trading approach introduced by John Moody and Matthew Saffell in which a recurrent policy maps the current market state directly to a trade position, and is trained by gradient ascent to maximise a financial performance function — profit, the Sharpe ratio, or the Differential Sharpe Ratio — net of transaction costs. It is the actor-only ancestor of modern deep RL trading: it skips an explicit price-forecasting step and optimises the trading objective end-to-end. The recurrence lets the policy condition on its own previous position, which matters because transaction costs depend on position changes. RRL was first set out empirically in Moody Wu Liao Saffell 1998 and consolidated as a journal method in Moody and Saffell 2001.

Moody and Saffell 2001 [defines] Recurrent Reinforcement Learning Trading Recurrent Reinforcement Learning Trading [part-of] Reinforcement Learning Trading Policy

Mechanically, RRL trains a single-layer recurrent network whose output F_t in [-1, 1] (or {short, flat, long}) is the position; the parameters are updated online by real-time recurrent learning (RTRL), differentiating the chosen performance function with respect to the weights through the recurrent dependence F_t = f(theta; F_{t-1}, prices, indicators). Moody & Saffell argued this is structurally superior to two alternatives: supervised “forecast then trade” pipelines (which optimise a forecast-error proxy, not P&L) and value-function reinforcement learning such as Q-learning (which suffers Bellman’s curse of dimensionality and needs a value-function estimate). Their Journal of Forecasting and IEEE results report RRL beating both Q-learning and MSE-trained forecasters on real data.

Recurrent Reinforcement Learning Trading [contradicts] Buy-and-Hold Benchmark Recurrent Reinforcement Learning Trading [relates] Transaction Costs and Slippage

On profitability the strategy must be graded carefully. The foundational evidence is genuinely out-of-sample and post-cost — the S&P 500 / T-Bill allocator over a 25-year test (1970-1994) with a 0.5% cost, and the USD/GBP intra-daily trader on 1996 data with bid/ask spreads — and on those tests RRL reportedly outperformed buy-and-hold and Q-learning (USD/GBP annualised Sharpe ~2.3). That places RRL above pure in-sample backtests. But the foundational data ends in the mid-1990s, the studies are single-group with thin samples and no reported drawdown or robustness statistics, and no code was released. The strategy is widely re-used (e.g. Carl Gold’s FX RRL work, Almahdi & Yang’s drawdown-objective portfolios, Dempster & Leemans) but those re-uses vary the objective and instruments rather than cleanly replicating the original headline returns.

Borrageiro Firoozye Barucca 2022 [contradicts] Recurrent Reinforcement Learning Trading

The most rigorous modern check is Borrageiro Firoozye Barucca 2022 (IEEE Access): a direct-RRL agent trading the major spot FX pairs over a seven-year out-of-sample window with carefully modelled transaction and funding costs achieves an annualised information ratio of only 0.52 and a 9.3% compound return — roughly a quarter of the ~2.3 Sharpe of the 1996 USD/GBP study. The reading is that RRL is a real, theoretically attractive method that can be slightly profitable net of costs over long horizons, but the dramatic risk-adjusted returns of the 1998-2001 papers do not reproduce. Net profitability_evidence_grade: moderate — out-of-sample and cost-aware foundational evidence, but limited robustness testing and no independent replication at the original strength; later deep variants (FDDR, LSTM/GRU encoders) inherit the same overfitting and Non-Stationarity problems.

Connections

Sources