Vision-Language Binding in In-Context Image Generation 文章

ArXiv CS.CV2026-05-26NEWSen作者: Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham

摘要

arXiv:2605.24624v1 Announce Type: new Abstract: In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges;

Vision-Language Binding in In-Context Image Generation 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (2)