m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning 文章

ArXiv CS.CV2026-06-17NEWSen作者: Yosub Shin, Michael Buriek, Igor Molybog

详细信息

来源站点
ArXiv CS.CV
作者
Yosub Shin, Michael Buriek, Igor Molybog
文章类型
NEWS
语言
en
发布日期
2026-06-17

摘要

arXiv:2601.19099v2 Announce Type: replace Abstract: Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($\kappa$ up to 0.76).

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据