Trust Region Q Adjoint Matching 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

Trust Region Q Adjoint Matching arXiv:2605.27079v1 Announce Type: cross Abstract: Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small