Preliminary Abstract

Traditional reinforcement learning (RL) defines agents as decision-makers that interact with their environment, observe state changes, and update future actions based on received rewards. In language model reasoning, however, training methods primarily reinforce final outputs, neglecting the intermediate decision-making process that guides solution trajectories. We propose Graph of Decisions for Intermediate Reward Modeling (GoD-IRM), a structured framework that assigns reward signals at every branching decision in a reasoning trajectory. Unlike existing rejection sampling and GRPO methods, which reinforce only correct completions, our approach models decision divergence, attributing higher rewards to paths that remain aligned with the optimal solution longer while penalizing early deviations. By structuring the solution space as a graph of decision nodes, GoD-IRM captures critical decision points, enabling models to learn more fine-grained decision-making heuristics and improving robustness in long-horizon reasoning tasks.