A sticky HDP-HMM with application to speaker\n diarization 论文

2011Project Euclid (Cornell University)引用 228
Bayesian Methods and Mixture ModelsSpeech Recognition and SynthesisGaussian Processes and Bayesian Inference

摘要

We consider the problem of speaker diarization, the problem\nof segmenting an audio recording of a meeting into temporal\nsegments corresponding to individual speakers. The problem is\nrendered particularly difficult by the fact that we are not\nallowed to assume knowledge of the number of people\nparticipating in the meeting. To address this problem, we take a\nBayesian nonparametric approach to speaker diarization that\nbuilds on the hierarchical Dirichlet process hidden Markov model\n(HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc.\n101 (2006) 1566–1581]. Although the basic HDP-HMM tends\nto over-segment the audio data—creating redundant states and\nrapidly switching among them—we describe an augmented HDP-HMM\nthat provides effective control over the switching rate. We also\nshow that this augmentation makes it possible to treat emission\ndistributions nonparametrically. To scale the resulting\narchitecture to realistic diarization problems, we develop a\nsampling algorithm that employs a truncated approximation of the\nDirichlet process to jointly resample the full state sequence,\ngreatly improving mixing rates. Working with a benchmark NIST\ndata set, we show that our Bayesian nonparametric architecture\nyields state-of-the-art speaker diarization results.