Sun Wang An 2021

“Reinforcement Learning for Quantitative Trading” by Shuo Sun, Rundong Wang and Bo An (Nanyang Technological University, Singapore) is a comprehensive survey first posted as arXiv:2109.13851 in 2021 and subsequently published, peer-reviewed, in ACM Transactions on Intelligent Systems and Technology 14(3), Article 44 (2023), DOI 10.1145/3582560. It shortlists more than 100 high-quality RL-for-quantitative-trading (QT) papers collected via Google Scholar and the top AI venues (NeurIPS, ICML, IJCAI, AAAI, KDD), and is, by the authors’ own statement, the first work to give the field “an in-depth taxonomy”, “analyze current challenges”, and “propose future directions”. Alongside Millea 2021 it is the vault’s evidence on what the RL-for-trading literature looks like as a whole.

Sun Wang An 2021 [defines] Reinforcement Learning Trading Policy Sun Wang An 2021 [supports] Markov Decision Process Trading Model

The survey’s core contribution is a two-axis taxonomy. On the application axis it distinguishes four trading task families. Algorithmic trading repeatedly buys and sells a single asset to maximise final net value, spanning position trading through swing, day, scalping and high-frequency styles. Portfolio management holds multiple assets and periodically reallocates a weight vector that sums to one. Order execution fulfils a given order at minimum cost, trading market impact from fast trading against price risk from slow trading, with TWAP as the classic baseline. Market making quotes two-sided prices and profits from the spread, its main hazard being inventory accumulation. On the algorithm axis the survey sorts methods into value-based (Q-learning, DQN), policy-based (Moody & Saffell’s recurrent RL and modern policy gradients), and actor-critic / deep-RL methods (DDPG, A2C, PPO, hierarchical RL). It traces the whole field to Moody and Saffell 2001, the first RL application to algorithmic trading, which optimised profit directly and introduced the Differential Sharpe Ratio as objective.

Sun Wang An 2021 [relates] Moody and Saffell 2001 Moody and Saffell 2001 [defines] Recurrent Reinforcement Learning Trading

The survey is clear about why RL is structurally attractive for trading, and these four arguments are the field’s standing rationale: RL trains an end-to-end agent that maps market state directly to a trading action; it bypasses the very hard task of forecasting future price, instead optimising profit directly and so avoiding the “unignorable gap between prediction signals and profitable trading actions”; task-specific frictions such as transaction cost and slippage can be folded straight into the RL objective; and RL has the potential to generalise across market conditions where rule-based momentum or mean-reversion strategies do not. Crucially, this is a statement of design appeal, not of demonstrated profitability — the survey presents it as the motivation for the research programme, not as a conclusion that the programme has succeeded.

Reinforcement Learning Trading Policy [relates] Transaction Costs and Slippage Sun Wang An 2021 [relates] Markov Decision Process Trading Model

Section 6, “Open issues and challenges”, is the survey’s skeptical core and the part most relevant to this vault’s profitability question. Sun, Wang and An name the field’s unsolved problems explicitly. Data scarcity: financial data is limited relative to the sample-hungry appetite of deep RL, so agents are easy to overfit and hard to train; the authors suggest model-based RL that learns a world model of the market. Non-stationarity / distribution shift: “the severe distribution shift of the financial market makes RL-based methods exhibit poor generalisation ability in new market conditions” — this is the Non-Stationarity failure mode stated at survey level, and the proposed remedies are meta-RL and transfer learning. The sim-to-real gap: “learning by directly interacting with the real market is risky and impractical”, so RL-QT is trained almost entirely offline on historical data — which is exactly why the literature is dominated by backtests and has almost no disclosed live evidence (Sim-to-Real Gap). Weak interpretability, which blocks human-trader acceptance. The survey is also specifically critical of order-execution work: most papers test only on stock data, use unrealistically long execution windows that make the task artificially easy, and assume no market impact, so they “will fail at huge trading volume” in real institutional settings.

Sun Wang An 2021 [supports] Non-Stationarity Sun Wang An 2021 [supports] Sim-to-Real Gap Non-Stationarity [causes] Sim-to-Real Gap

The survey’s verdict, read honestly, is a research map and not a profitability claim — which is why its profitability_evidence_grade is inconclusive. It catalogues influential research prototypes and reported simulation results, but those results sit on heterogeneous private datasets and the survey itself does not test or replicate any of them; it explicitly defers a tradeable-edge verdict and instead points to open problems. Together with Millea 2021 the message is convergent: two independent surveys aggregating well over a hundred papers each describe an active, fast-moving research programme whose foundational obstacles — non-stationarity, data scarcity, sim-to-real, and (per Millea) reproducibility — are unsolved. The survey-level evidence is therefore that RL trading is a promising line of inquiry, not a substantiated body of profitable systems, and any individual paper’s backtest Sharpe should be read against that backdrop as a hypothesis rather than a result.

Sun Wang An 2021 [supports] Millea 2021 Sun Wang An 2021 [relates] Overfitting in Quantitative Trading

Connections

Sources