UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer 文章

ArXiv CS.CV2026-06-16NEWSen作者: Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang

详细信息

来源站点
ArXiv CS.CV
作者
Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.16255v1 Announce Type: new Abstract: Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding.