Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions 文章

ArXiv CS.CL2026-05-29NEWSen作者: Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

摘要

arXiv:2605.28833v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging.