Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers 文章

ArXiv CS.CV2026-06-16NEWSen作者: Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher

详细信息

来源站点
ArXiv CS.CV
作者
Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.14757v1 Announce Type: new Abstract: Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据