详细信息
- 来源站点
- ArXiv CS.CV
- 作者
- Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-06-16
摘要
arXiv:2606.14757v1 Announce Type: new Abstract: Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance.
相关事件
暂无数据
相关公司
暂无数据
相关人物
暂无数据