AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates 文章

ArXiv CS.CL2026-05-28NEWSen作者: Shaolong Chen, Madalina Ciobanu, Qingqing Mao, Ritankar Das

摘要

arXiv:2605.28440v1 Announce Type: new Abstract: DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model's generation probabilities, with the reference model's probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities;

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据