The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space 文章

ArXiv CS.CV2026-06-02NEWSen作者: Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

摘要

arXiv:2605.09883v2 Announce Type: replace Abstract: As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit.