Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments 文章

ArXiv CS.CV2026-06-16NEWSen作者: Marta Vallejo, Siwen Wang

详细信息

来源站点
ArXiv CS.CV
作者
Marta Vallejo, Siwen Wang
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.15202v1 Announce Type: new Abstract: Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns.