Exploring convolutional neural network structures and optimization techniques for speech recognition 论文

2013引用 387
Speech Recognition and SynthesisSpeech and Audio ProcessingMusic and Audio Processing

摘要

Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neu-ral networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recogni-tion task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate sev-eral CNN architectures, including full and limited weight shar-ing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have in-vestigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recog-nition tasks. The architecture with limited weight sharing pro-vides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretrain-ing produces significantly better results on a large vocabulary speech recognition task.