Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness 文章

ArXiv CS.CV2026-06-02NEWSen作者: Mahmoud Mannes

摘要

arXiv:2606.00124v1 Announce Type: new Abstract: Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据