Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output 事件

PRODUCT_LAUNCH2026-06-10影响: MEDIUM

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output arXiv:2606.10528v1 Announce Type: cross Abstract: Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware a