Multimodal Action Diffusion for Robust End-to-End Autonomous Driving 文章

ArXiv CS.CV2026-06-02NEWSen作者: Jorge Daniel Rodr\'iguez-Vidal, Diego Porres, Gabriel Villalonga Pineda, Antonio M. L\'opez Pe\~na

摘要

arXiv:2606.02105v1 Announce Type: new Abstract: End-to-End Autonomous Driving (E2E-AD) systems have largely converged on predicting intermediate trajectory waypoints, delegating final control to hand-crafted controllers with GPS access. Direct control-signal prediction (outputting throttle, steer and brake in an end-to-end fashion) remains underexplored, and critically, the role of action multimodality in such systems is not well understood. We argue that moving beyond deterministic, single-action outputs is not merely a modelling choice, but a key driver of driving performance, representational quality, and training stability. To validate this, we introduce the Action Diffusion Transformer (ADT), an anchor-free diffusion transformer trained with a MSE objective that natively models the multimodal distribution of plausible driving actions.