Exploring convolutional neural network structures and optimization techniques for speech recognition 论文
摘要
Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neu-ral networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recogni-tion task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate sev-eral CNN architectures, including full and limited weight shar-ing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have in-vestigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recog-nition tasks. The architecture with limited weight sharing pro-vides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretrain-ing produces significantly better results on a large vocabulary speech recognition task.