Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL 文章

ArXiv CS.CL2026-05-28NEWSen作者: Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

查看原文 →

关系图谱

摘要

arXiv:2605.28751v1 Announce Type: cross Abstract: Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged.

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术