Preliminary Abstract

Traditional training paradigms in mathematical reasoning, code generation, and natural language tasks enforce correctness via explicit reward signals, overlooking the latent structure within incorrect yet coherent outputs. We introduce Rearrangement Sampling, a framework that repurposes rejected solutions by reconstructing corresponding problem statements, enabling models to generalize beyond correctness. Given an initial task producing 𝑛 divergent completions, a larger model acts as a judge—akin to RLAIF—to determine whether each completion can be mapped to a valid alternative problem. This process expands the problem-solution space by generating 𝑛 − 1 additional problems per task, each inheriting solutions from the original distribution with assigned relative rewards. By leveraging incorrect yet interpretable generations, Rearrangement Sampling increases the efficiency of data reuse, enhances distributional coverage, and enables more structured exploration of reasoning pathways in learning systems.