Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning 文章

ArXiv CS.AI2026-05-28NEWSen作者: Sebastian Sanokowski, Kaustubh Patil

摘要

arXiv:2512.02019v3 Announce Type: replace-cross Abstract: Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and introduce Diffusion-Augmented Markov Decision Processes (DA-MDPs). DA-MDPs allow for seamless integration of diffusion policies into any ME-RL method with minimal modifications. We demonstrate its effectiveness by adapting Proximal Policy Optimization (PPO), Wasserstein Policy Optimization (WPO), and Relative Entropy Pathwise Policy Optimization (REPPO) into their diffusion-based variants: DA-MDP: PPO, DA-MDP: WPO, and DA-MDP: REPPO.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据