Head-Pose-Aware Visual Speech Recognition with FiLM Modulation 文章

ArXiv CS.CV2026-06-02NEWSen作者: Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

摘要

arXiv:2606.00751v1 Announce Type: new Abstract: Visual Speech Recognition (VSR) aims to recognize speech from visual cues such as lip movements, but its performance is fundamentally limited by viseme ambiguity and pose-induced variations that introduce geometric distortions and occlusions. Existing approaches mainly rely on linguistic context or implicit invariance, leaving visual representations insufficiently robust under non-frontal views. In this work, we propose a pose-aware phoneme-level framework, termed HP-VSR-ResFiLM, that explicitly incorporates head-pose information into visual feature extraction. The proposed framework adopts a two-stage pipeline consisting of a pose-conditioned visual encoder in Stage 1 and a pretrained NLLB language model in Stage 2 for phoneme-to-text reconstruction.

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据