When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift 文章

ArXiv CS.CL2026-05-26NEWSen作者: Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen
文章类型: NEWS
语言: en
发布日期: 2026-05-26

原文

摘要

arXiv:2605.25629v1 Announce Type: new Abstract: Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train--test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation.

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (2)