Lalor Swishchuk 2025
“Deep Reinforcement Learning in Non-Markov Market-Making” by Luca Lalor and Anatoliy Swishchuk (University of Calgary, Department of Mathematics and Statistics), arXiv:2410.14504 (q-fin.TR), published in Risks 13(3):40, 2025 (MDPI). The paper builds a deep-reinforcement-learning Market Making agent and trains it with the Soft Actor-Critic (SAC) algorithm — an off-policy, entropy-maximising actor-critic suited to the continuous, high-dimensional state and action spaces of optimal market making. It appears in this vault for two reasons: it provides a clean, explicit Markov Decision Process Trading Model definition for trading, and it is unusually candid that the Markov assumption underlying that formulation is empirically false at the order-book level.
The paper gives the full MDP machinery — the (S, A, T, R) tuple, deterministic and stochastic policies, the state value function V^π and state-action value Q^π, the discounted-reward objective over a finite horizon, and the soft Bellman residual used to train the SAC critic. It then argues against the classical stochastic-optimal-control lineage (Almgren Chriss 2000 and Avellaneda-Stoikov 2008, the “AS model”). Three reasons are given: (1) model uncertainty — stochastic control “requires a well-defined model of the market dynamics, transition probabilities and reward structure,” unrealistic given latent variables like sentiment, order flow and microstructure noise; (2) high-dimensional state-action spaces make closed-form control intractable; (3) adaptivity — control solutions are “predefined … then applied without any further adaptation,” whereas RL “can continuously improve and adapt.”
Lalor Swishchuk 2025 [proposes_model] Markov Decision Process Trading Model Lalor Swishchuk 2025 [tests_strategy] Market Making Lalor Swishchuk 2025 [opposes] Avellaneda-Stoikov 2008
The title is the warning. “Many studies have shown that LOB dynamics often follow non-Markovian properties.” Real Limit Order Book data “often experiences jumps, i.e., points of discontinuity,” and models “that can portray a dependency in past trade transactions are superior to models with the assumption of an infinitesimal tick size seen in the arithmetic Brownian motion model.” The authors therefore drive the midprice with semi-Markov and Hawkes Process jump-diffusion dynamics — explicitly not memoryless — which directly contradicts the Markov property an MDP nominally requires. This makes the central tension of the vault concrete: the true environment has memory that no finite MDP state can fully encode. The experimental setup used 10⁶ simulations (1000 training episodes), an out-of-sample test of 200 unseen simulated episodes with the policy frozen, maximum inventory capped at q=5, a fixed bid-ask spread Δ=0.01, a running inventory penalty of 0.001, and a non-adverse fill probability of 20% calibrated to liquid CME futures.
Lalor Swishchuk 2025 [contradicts] Markov Decision Process Trading Model Hawkes Process [part-of] Lalor Swishchuk 2025 Lalor Swishchuk 2025 [trades_market] Limit Order Book
Section 4 states the backtest artefacts plainly, and this is the paper’s most valuable contribution to a skeptical reader. “This is very important as many models built using much of the standard mathematical finance theory in algorithmic and HFT have often been shown to over-inflate results.” The named inflations: (1) midprice trading — “trading in financial markets generally occurs at the bid and ask, not the midprice,” so a constant 1-tick-spread midprice simulation “could in fact be highly unrealistic”; (2) Adverse Selection / phantom gains — frameworks assuming price and order arrivals are independent generate “large phantom gains,” because market makers “as liquidity providers are often on the wrong side of the trade” — the authors add an explicit adverse-fill mechanism to correct this; (3) queue position — most HFT literature assumes orders are “automatically at the front of the queue,” ignoring price-time priority; (4) price ticks — diffusion models ignore the fixed price grid. The authors conclude results should be read “with a grain of salt as in any trading strategy back-test.”
Adverse Selection [causes] Phantom Gains in Backtests Lalor Swishchuk 2025 [supports] Phantom Gains in Backtests Lalor Swishchuk 2025 [lacks_live_evidence] Out-of-Sample Backtesting
Profitability grade: inconclusive. The paper is methodologically careful and explicitly does not claim live profitability — its contribution is a more realistic simulation of the market-making MDP, not a demonstrated edge. The out-of-sample test is 200 simulated episodes, not real-data or live trading; no Sharpe ratio, return or drawdown is reported as an audited headline figure; replication code is not confirmed available; and the authors themselves recommend testing across unseen market regimes and building a “statistically valid back-test” before any deployment claim. As an evidence source the note is graded confirmed — the paper genuinely exists, is peer-reviewed in Risks, and its claims are reported accurately here — but its profitability evidence for market making is inconclusive.
Connections
- Markov Decision Process Trading Model — proposes_model / tests_strategy, 2024-2025, source: https://arxiv.org/html/2410.14504v2
- Market Making — tests_strategy, source: https://arxiv.org/html/2410.14504v2
- Limit Order Book — trades_market, source: https://arxiv.org/html/2410.14504v2
- Hawkes Process — uses_dataset (drives the simulated price process), source: https://arxiv.org/html/2410.14504v2
- Avellaneda-Stoikov 2008 — compares_benchmark (rejects stochastic control for RL), source: https://people.orie.cornell.edu/sfs33/LimitOrderBook.pdf
- Reinforcement Learning Trading Policy — proposes_model (Soft Actor-Critic), source: https://arxiv.org/html/2410.14504v2
- Adverse Selection — includes_costs (explicit adverse-fill mechanism), source: https://arxiv.org/html/2410.14504v2
- Phantom Gains in Backtests — suffers_overfitting_risk, source: https://arxiv.org/html/2410.14504v2
- Out-of-Sample Backtesting — lacks_live_evidence (200 simulated episodes only), source: https://arxiv.org/html/2410.14504v2