SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs 文章

ArXiv CS.CV2026-05-26NEWSen作者: Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko

摘要

arXiv:2601.20273v2 Announce Type: replace-cross Abstract: Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine.

SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (2)

相关人物

相关产品查看全部 (7)

相关技术查看全部 (24)