Explicit Critic Guidance for Aligning Diffusion Models 文章

ArXiv CS.CV2026-05-28NEWSen作者: Zhengyang Liang, Qihang Zhang, Ceyuan Yang

摘要

arXiv:2605.27736v1 Announce Type: cross Abstract: Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking.

Explicit Critic Guidance for Aligning Diffusion Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (3)