AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates 文章

ArXiv CS.CL2026-05-28NEWSen作者: Shaolong Chen, Madalina Ciobanu, Qingqing Mao, Ritankar Das

摘要

arXiv:2605.28440v1 Announce Type: new Abstract: DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities, with the reference model's probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities;

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (4)