WhisperX: Time-Accurate Speech Transcription of Long-Form Audio 论文

2023引用 231

Speech and Audio ProcessingMusic and Audio ProcessingSpeech Recognition and Synthesis

Speech Recognition and Synthesis Speech and Audio Processing Music and Audio Processing

作者

摘要

Batch Input audio <|transcribe|> Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.

作者查看全部 (3)

Tengda Han

Jaesung Huh

Max Bain

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio 论文

摘要

作者查看全部 (3)

相关技术查看全部 (3)

相关事件

相关文章