FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions 文章
详细信息
- 来源站点
- ArXiv CS.CL
- 作者
- Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-05-27
摘要
arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data.