The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models 文章

ArXiv CS.CV2026-06-16NEWSen作者: Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

详细信息

来源站点: ArXiv CS.CV
作者: Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2606.14783v1 Announce Type: new Abstract: A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count.

The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术