Reward Design Sensitivity

Reward design sensitivity is the failure mode in which a Reinforcement Learning Trading Policy’s measured performance changes substantially with the choice of reward function and its associated hyperparameters. Because the Markov Decision Process Trading Model underneath an RL trading agent only ever optimises whatever reward it is given, the reward is the single most consequential modelling choice — and in practice it is hand-picked by trial and error. This note is the RL-search face of the problem; its sibling Reward Specification Error is the MDP-objective face — the risk that the chosen reward does not faithfully encode the trader’s true objective. Reward design sensitivity is specifically about how fragile and tunable reported RL edges are because the reward itself is a degree of freedom.

The space of plausible rewards is large and the choices are not equivalent. Raw profit, the Sharpe ratio, the Differential Sharpe Ratio, drawdown- or downside-penalised return, inventory-penalised PnL, and composite multi-term rewards all encode different objectives and produce different policies. Srivastava, Aryan & Singh (2025) make the search structure explicit: they propose a composite reward combining four weighted components (annualised return, downside risk, differential return, Treynor ratio) and tune the four weights w1..w4 by grid search over the simplex — reward design layered as an extra hyper-parameter search on top of model training. They are candid that single-metric rewards “can encourage reward hacking or over-optimization of one aspect of trading” and that different weight settings “shift the agent toward more conservative or aggressive trading behaviors.” That sensitivity is exactly the failure mode: the same market data, the same model class, materially different reported behaviour depending on the reward chosen.

Reward Design Sensitivity [relates] Reward Specification Error Reward Design Sensitivity [relates] Markov Decision Process Trading Model

The hyperparameter dimension compounds it. Gort et al. 2022 note that deep RL algorithms are “highly sensitive to hyperparameters,” producing high variability across trials, so a researcher can get “lucky” and report an over-optimistic agent. Zhang, Zohren & Roberts (2019) likewise tabulate distinct hyperparameter sets per RL algorithm. Learning rate, discount factor, network architecture, the cost-penalty weight and the random seed all move results, and the published number is typically the best run rather than the distribution. When the reward and hyperparameters are tuned to maximise the figure that is then reported, every variant tried is another strategy tested — the effective number of strategies explodes, and with it the probability that the best-looking agent is a Data-Snooping Bias artefact rather than a genuine edge.

Gort et al. 2022 [supports] Reward Design Sensitivity Reward Design Sensitivity [causes] Overfitting in Quantitative Trading Reward Design Sensitivity [causes] Data-Snooping Bias

The risk is well evidenced and its cost is demonstrated. Gort et al.’s own brutal experiment shows the endpoint: even after applying a combinatorial-cross-validation overfitting test to reject over-tuned agents, the best surviving DRL crypto agent still lost ~35% in an out-of-sample crash window — overfitting control improved relative robustness but produced no absolute profitability. For this vault’s research goal the verdict is confirmed: reward design sensitivity is a real mechanical reason RL trading backtests are fragile and hard to reproduce. The mitigations — fixing the reward before evaluation, reporting seed-level variance, deflating the Sharpe ratio for the number of reward/hyperparameter trials — reduce the illusion of edge but do not, on the current evidence, reveal a robust tradeable one.

Reward Design Sensitivity [relates] Differential Sharpe Ratio Reward Design Sensitivity [relates] Goodhart’s Law

Connections

Sources