UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer 文章

ArXiv CS.CV2026-06-16NEWSen作者: Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang

详细信息

来源站点: ArXiv CS.CV
作者: Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2606.16255v1 Announce Type: new Abstract: Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding.

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (5)