Hambly Xu Yang 2023
“Recent Advances in Reinforcement Learning in Finance” is a 60-page survey by Ben Hambly (Mathematical Institute, University of Oxford), Renyuan Xu (then USC, Epstein Department of Industrial and Systems Engineering) and Huining Yang. It was posted to arXiv as preprint 2112.04553 in December 2021 (revised February 2023) and published in Mathematical Finance 33(3):437-503 in July 2023. It is the authoritative reference in this vault for one specific claim: that the Markov Decision Process is “the setting for many of the commonly used RL approaches” to financial decision-making. The survey opens by constructing the MDP — state space, action space, transition kernel, reward function, discount factor, policy, and the state-value and action-value functions linked by the Bellman equation — and only then introduces algorithms.
Hambly Xu Yang 2023 defines Markov Decision Process Trading Model Hambly Xu Yang 2023 supports Markov Decision Process Trading Model
The survey’s central organising distinction is exactly the one this vault investigates: the MDP is the problem formulation, and the solution method is a separate choice. Classical stochastic control and Dynamic Programming solve the MDP analytically when the transition kernel and reward are fully specified — at the cost of “heavily rel[ying] on model assumptions.” Reinforcement Learning Trading Policy methods solve the same MDP from sampled experience when the model is unknown, which the authors frame as the advantage of “mak[ing] full use of the large amount of financial data with fewer model assumptions.” Value- and policy-based algorithms are reviewed with “a focus on value and policy based methods that do not require any model assumptions,” and deep RL extends these with neural-network function approximation for high-dimensional state spaces.
Markov Decision Process Trading Model relates Reinforcement Learning Trading Policy Dynamic Programming supports Markov Decision Process Trading Model
The applications section casts six finance domains as MDPs: Optimal Execution (timing and sizing trades to minimise price impact and timing risk, with Almgren Chriss 2000 as the classical reference), portfolio optimisation (multi-period allocation), option pricing and hedging, Market Making (inventory control as a sequential decision), smart order routing, and robo-advising. In every case the MDP is presented as the natural scaffolding for a forward-looking decision rather than a one-shot optimisation.
Hambly Xu Yang 2023 relates Optimal Execution Hambly Xu Yang 2023 relates Market Making
On profitability the honest grade is inconclusive, and this matters for how the paper is cited. This is a survey, not an empirical study: it reports no asset returns, no Sharpe ratios, no drawdowns, no backtest, and no out-of-sample validation. It is frequently invoked as if “trading is an MDP” were an endorsement of MDP- or RL-based trading — it is not. The survey itself enumerates the obstacles to industrial adoption: the difficulty of correctly specifying the MDP (state representation and reward design), the lack of robustness and generalisability of learned policies, the cost of exploration in live markets, and the simulation-to-deployment gap. It explicitly notes that markets are non-stationary and only partially observable, which strains the Markov assumption underpinning the entire construction. The correct reading is that Hambly-Xu-Yang establishes a rigorous, unifying vocabulary — it confers no edge and makes no profit claim.
Non-Stationarity opposes Markov Decision Process Trading Model Partial Observability opposes Markov Decision Process Trading Model
Connections
- Markov Decision Process Trading Model — proposes_model, 2023, source: https://arxiv.org/abs/2112.04553
- Reinforcement Learning Trading Policy — relates, 2023, source: https://arxiv.org/abs/2112.04553
- Dynamic Programming — relates, 2023, source: https://arxiv.org/abs/2112.04553
- Optimal Execution — optimises_policy, 2023, source: https://arxiv.org/abs/2112.04553
- Market Making — optimises_policy, 2023, source: https://arxiv.org/abs/2112.04553
- Almgren Chriss 2000 — relates, 2023, source: https://arxiv.org/abs/2112.04553
- Non-Stationarity — suffers_overfitting_risk, 2023, source: https://arxiv.org/abs/2112.04553
- Partial Observability — suffers_overfitting_risk, 2023, source: https://arxiv.org/abs/2112.04553