Extracting deep bottleneck features using stacked auto-encoders 论文

2013引用 279

Speech Recognition and SynthesisMusic and Audio ProcessingSpeech and Audio Processing

Speech Recognition and Synthesis Speech and Audio Processing Music and Audio Processing

作者

摘要

In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

作者查看全部 (4)

Alex Waibel

Florian Metze

Yajie Miao

Jonas Gehring

Extracting deep bottleneck features using stacked auto-encoders 论文

详细信息

摘要

作者查看全部 (4)

相关技术查看全部 (3)

相关事件

相关文章