Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning arXiv:2602.05370v3 Announce Type: replace Abstract: Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields d