WhisperX: Time-Accurate Speech Transcription of Long-Form Audio 论文
2023引用 231
Speech and Audio ProcessingMusic and Audio ProcessingSpeech Recognition and Synthesis
摘要
Batch Input audio <|transcribe|> Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.