Reward Design Sensitivity
Reward design sensitivity is the failure mode in which a Reinforcement Learning Trading Policy’s measured performance changes substantially with the choice of reward function and its associated hyperparameters. Because the Markov Decision Process Trading Model underneath an RL trading agent only ever optimises whatever reward it is given, the reward is the single most consequential modelling choice — and in practice it is hand-picked by trial and error. This note is the RL-search face of the problem; its sibling Reward Specification Error is the MDP-objective face — the risk that the chosen reward does not faithfully encode the trader’s true objective. Reward design sensitivity is specifically about how fragile and tunable reported RL edges are because the reward itself is a degree of freedom.
The space of plausible rewards is large and the choices are not equivalent. Raw profit, the Sharpe ratio, the Differential Sharpe Ratio, drawdown- or downside-penalised return, inventory-penalised PnL, and composite multi-term rewards all encode different objectives and produce different policies. Srivastava, Aryan & Singh (2025) make the search structure explicit: they propose a composite reward combining four weighted components (annualised return, downside risk, differential return, Treynor ratio) and tune the four weights w1..w4 by grid search over the simplex — reward design layered as an extra hyper-parameter search on top of model training. They are candid that single-metric rewards “can encourage reward hacking or over-optimization of one aspect of trading” and that different weight settings “shift the agent toward more conservative or aggressive trading behaviors.” That sensitivity is exactly the failure mode: the same market data, the same model class, materially different reported behaviour depending on the reward chosen.
Reward Design Sensitivity [relates] Reward Specification Error Reward Design Sensitivity [relates] Markov Decision Process Trading Model
The hyperparameter dimension compounds it. Gort et al. 2022 note that deep RL algorithms are “highly sensitive to hyperparameters,” producing high variability across trials, so a researcher can get “lucky” and report an over-optimistic agent. Zhang, Zohren & Roberts (2019) likewise tabulate distinct hyperparameter sets per RL algorithm. Learning rate, discount factor, network architecture, the cost-penalty weight and the random seed all move results, and the published number is typically the best run rather than the distribution. When the reward and hyperparameters are tuned to maximise the figure that is then reported, every variant tried is another strategy tested — the effective number of strategies explodes, and with it the probability that the best-looking agent is a Data-Snooping Bias artefact rather than a genuine edge.
Gort et al. 2022 [supports] Reward Design Sensitivity Reward Design Sensitivity [causes] Overfitting in Quantitative Trading Reward Design Sensitivity [causes] Data-Snooping Bias
The risk is well evidenced and its cost is demonstrated. Gort et al.’s own brutal experiment shows the endpoint: even after applying a combinatorial-cross-validation overfitting test to reject over-tuned agents, the best surviving DRL crypto agent still lost ~35% in an out-of-sample crash window — overfitting control improved relative robustness but produced no absolute profitability. For this vault’s research goal the verdict is confirmed: reward design sensitivity is a real mechanical reason RL trading backtests are fragile and hard to reproduce. The mitigations — fixing the reward before evaluation, reporting seed-level variance, deflating the Sharpe ratio for the number of reward/hyperparameter trials — reduce the illusion of edge but do not, on the current evidence, reveal a robust tradeable one.
Reward Design Sensitivity [relates] Differential Sharpe Ratio Reward Design Sensitivity [relates] Goodhart’s Law
Connections
- Reward Specification Error — relates, the MDP-objective face of the same hand-designed-reward problem, source: https://arxiv.org/html/2410.14504v2
- Goodhart’s Law — relates, tuning the reward to maximise a reported metric Goodharts that metric, source: https://arxiv.org/abs/2201.03544
- Reinforcement Learning Trading Policy — suffers_overfitting_risk, source: https://arxiv.org/abs/2209.05559
- Markov Decision Process Trading Model — relates, the MDP optimises whatever reward it is given, source: https://arxiv.org/abs/2209.05559
- Gort et al. 2022 — supports, DRL agents “highly sensitive to hyperparameters”, lucky agents look over-optimistic, source: https://arxiv.org/abs/2209.05559
- Overfitting in Quantitative Trading — suffers_overfitting_risk, reward variants inflate the effective strategy count, source: https://arxiv.org/abs/2209.05559
- Data-Snooping Bias — suffers_overfitting_risk, tuning the reward on the test set is data snooping, source: https://arxiv.org/abs/2209.05559
- Differential Sharpe Ratio — relates, one of the candidate reward functions whose choice changes results, source: https://arxiv.org/html/2506.04358v1
Sources
- Gort et al. (2022/2023), Deep Reinforcement Learning for Cryptocurrency Trading, arXiv:2209.05559
- Srivastava, Aryan & Singh (2025), Risk-Aware Reinforcement Learning Reward for Financial Trading, arXiv:2506.04358
- Zhang, Zohren & Roberts (2019), Deep Reinforcement Learning for Trading, arXiv:1911.10107