COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation 文章

ArXiv CS.CV2026-06-02NEWSen作者: Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang

摘要

arXiv:2606.00954v1 Announce Type: new Abstract: Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism.