Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models 文章

ArXiv CS.CV2026-05-27NEWSen作者: Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

摘要

arXiv:2605.26491v1 Announce Type: cross Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the…

摘要可能不完整,可查看原文