FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions 文章

ArXiv CS.CL2026-05-27NEWSen作者: Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad
文章类型: NEWS
语言: en
发布日期: 2026-05-27

原文

摘要

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data.

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions 文章

详细信息

摘要

相关事件

相关公司查看全部 (3)

相关人物

相关产品查看全部 (6)

相关技术查看全部 (20)