On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning 文章

ArXiv CS.CV2026-05-28NEWSen作者: Siyi Lyu, Quan Liu, Feng Yan

摘要

arXiv:2601.03048v2 Announce Type: replace Abstract: Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据