TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation 文章

ArXiv CS.CV2026-06-08NEWSen作者: Dian Gu, Zhengyi Yang

摘要

arXiv:2606.07053v1 Announce Type: new Abstract: Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability.

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation 文章

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (7)