Your Multimodal Speech Model Says I Have a Face for Radio 文章

ArXiv CS.CL2026-06-01NEWSen作者: Maya K. Nachesa, Vlad Niculae, Vagrant Gautam

详细信息

来源站点: ArXiv CS.CL
作者: Maya K. Nachesa, Vlad Niculae, Vagrant Gautam
文章类型: NEWS
语言: en
发布日期: 2026-06-01

摘要

arXiv:2605.30472v1 Announce Type: new Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to audio-visual data for noise mitigation and multimodal subtitling. While performance and bias have been studied extensively in the single-modality regime, it is unknown how new modalities affect this, even though they produce biases in humans. We therefore propose the first bias evaluation of multimodal speech recognition, where we create videos pairing different faces with the same audio, and measure changes in speech transcription accuracy. We find large quality-of-service differences across mWhisper-Flamingo and Gemini models, with drops of up to 4.05 word error rate points, across self-declared gender, ethnicity, and their intersection.

Your Multimodal Speech Model Says I Have a Face for Radio 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术