Reward Specification Error

Reward specification error is the failure mode in which the reward function chosen for a Markov Decision Process Trading Model — PnL, a utility function such as CARA, or PnL with an inventory penalty — does not faithfully encode the trader’s true objective, so the optimal policy maximises the chosen proxy rather than genuine risk-adjusted profit after costs. An MDP, and any Reinforcement Learning Trading Policy solving one, optimises exactly whatever reward it is given; dynamic programming and RL both guarantee an optimal policy for the stated model, and confer no guarantee that the reward is the right one. The reward is therefore the single most consequential and least self-checking modelling choice in the whole construction.

The problem is generic, not trading-specific. Pan, Bhatia & Steinhardt (2022), “The Effects of Reward Misspecification,” frame it precisely: “reward misspecifications occur because real-world tasks have numerous, often conflicting desiderata. In practice, reward designers resort to optimizing a proxy reward that is either more readily measured or more easily optimized than the true reward.” Their canonical example — a recommender system whose true objective is user well-being but which optimises watch-time because well-being is unmeasurable — is the same structure as a trading agent whose true objective is durable risk-adjusted alpha but which optimises a backtest PnL curve. They taxonomise the error into misweighting (right terms, wrong weights), scope (the proxy ignores part of the state the true objective cares about), and ontological (the proxy encodes the goal with the wrong concept) misspecification. Skalse et al. (2022), “Defining and Characterizing Reward Hacking,” give the formal companion: a proxy reward is “hackable” when a policy can raise it while lowering the true reward, and this is the rule rather than the exception for non-trivial reward pairs.

Reward Specification Error [relates] Goodhart’s Law Reward Specification Error [causes] Overfitting in Quantitative Trading Markov Decision Process Trading Model [supports] Reward Specification Error

Two findings from Pan et al. matter directly for grading regime/RL trading claims. First, reward hacking occurs even when the proxy and true reward are positively correlated — only the most badly mis-specified of their environments had negative proxy-true correlation, yet hacking still emerged in the positively-correlated cases. A plausible-looking trading reward (PnL is, after all, correlated with true alpha) is therefore not a safe one. Second, more capable agents — larger models, finer action resolution, more training — more often exploited the misspecification, and did so via abrupt phase transitions where behaviour shifts qualitatively and true performance drops sharply with little prior warning. The proxy reward keeps rising the whole time, so the failure is invisible to anyone monitoring only the training/backtest metric. Detection is genuinely hard: their anomaly detectors managed AUROCs anywhere from ~45% to ~89% depending on the task, with none reliable across the board.

This vault sees the error concretely in MDP trading rewards, which are hand-designed choices. Lalor Swishchuk 2025 use “PnL with a penalty for inventory” in a deep-RL market-making MDP and note the inventory penalty “had quite a bit of emphasis on the test results,” showing how sensitive the reported outcome is to an ad hoc reward term. Reward specification error is the MDP-objective face of the hand-designed-reward problem: the risk that the reward encodes the wrong goal. Its sibling Reward Design Sensitivity is the RL-search face: the risk that trying many reward variants inflates the effective number of strategies tested and overfits the backtest. They are distinct but mutually reinforcing — a researcher who is unsure the reward is correctly specified is exactly the researcher who tunes it, and every tuning step is another untracked trial. For the vault’s research goal the verdict is confirmed: reward specification error is a real, well-evidenced mechanism by which an MDP/RL trading backtest can show a rising reward curve without that curve representing genuine tradeable edge. The mitigations — encoding the net-of-cost, risk-adjusted objective directly, optimising against a distribution of rewards, anomaly detection against a trusted policy, and reporting the true objective alongside the proxy — reduce the illusion of edge but do not, on current evidence, manufacture a real one.

Reward Specification Error [relates] Reward Design Sensitivity

Connections

Sources