Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers arXiv:2603.21558v2 Announce Type: replace Abstract: Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets