FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning 文章

ArXiv CS.CV2026-06-05NEWSen作者: Marian Lupascu, Nipun Jindal, Ionut Mironica, Zhaowen Wang

摘要

arXiv:2606.06066v1 Announce Type: new Abstract: Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks.