When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift 文章

ArXiv CS.CL2026-05-26NEWSen作者: Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift · 相关技术