摘要
arXiv:2603.21558v2 Announce Type: replace Abstract: Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets lucky guesses (correct answers with flawed reasoning) contaminate the training data. We propose Verified Self-Improvement (VSI), a framework that conditions data retention on step-level structural integrity rather than just the final output. VSI validates solutions by recomputing arithmetic steps via a computer-algebra library (sympy), checking intermediate consistency, and enforcing domain constraints.
相关事件
暂无数据
相关公司
暂无数据
相关人物
暂无数据