OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following 文章

ArXiv CS.CV2026-05-27NEWSen作者: Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

摘要

arXiv:2605.26399v1 Announce Type: new Abstract: Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning.