Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization 事件

Name: Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
Start: 2026-06-02

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization arXiv:2510.05342v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter

人工智能

关系图谱

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization 事件

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization · 相关技术

相关技术