FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions 文章

ArXiv CS.CL2026-05-27NEWSen作者: Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

详细信息

来源站点
ArXiv CS.CL
作者
Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad
文章类型
NEWS
语言
en
发布日期
2026-05-27

摘要

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data.