Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated 文章

ArXiv CS.CV2026-06-02NEWSen作者: Rashid Mushkani

摘要

arXiv:2606.00871v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据