Preliminary Abstract

Standard autoregressive decoding in large language models (LLMs) generates multiple completions per query but lacks mechanisms for iterative self-improvement. We propose a method that integrates environment feedback into the generation process, ensuring that each subsequent candidate ๐‘› + 1 is conditioned on the feedback received by candidate ๐‘›, producing a progressively informed sequence of outputs. In code generation, this involves executing generated programs, analyzing runtime traces, and refining subsequent completions to systematically reduce errors. In reinforcement learning-based post-training, the method replaces static ranking with verifiable reward-driven refinement, allowing models to iteratively adjust based on structured feedback rather than predefined preferences. By explicitly modeling iterative self-correction, this approach enables progressive error minimization, improves adaptive reasoning, and enhances task-specific generalization. Empirical results will evaluate its impact on code synthesis and decision-making tasks, demonstrating improvements in functional correctness and reward optimization.