Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders 文章
摘要
arXiv:2606.00746v1 Announce Type: new Abstract: Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation.
相关事件查看全部 (1)
相关公司
暂无数据
相关人物
暂无数据