Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs 文章

ArXiv CS.CV2026-05-26NEWSen作者: Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

查看原文 →

关系图谱

摘要

arXiv:2506.10054v4 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO.

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (5)

相关人物

相关产品查看全部 (11)

相关技术查看全部 (23)