Imagine Before You Draw: Visual Prompt Engineering for Image Generation 文章

ArXiv CS.CV2026-06-04NEWSen作者: Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

查看原文 →

关系图谱

摘要

arXiv:2606.04457v1 Announce Type: new Abstract: Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (15)

相关技术