HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion 文章

ArXiv CS.CV2026-06-04NEWSen作者: Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li

摘要

arXiv:2605.15741v2 Announce Type: replace Abstract: Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes.