Problem statement and motivation
An unsupervised representation for speech, i.e., one that could be trained directly with large amounts of unlabelled speech recordings, and could disentangle the main factors underlying speech variability, would have a major impact on many speech processing tasks, for the following reasons. Labelling of data for speech is done through manual transcription and is very expensive. Transcribing an audio recording into a time-aligned text requires between 40 and 100 hours of work for each hour of recording (Seifart, Evans, Hammarström & Levinson, 2018). Collecting a reasonable amount of labelled data can even be infeasible for some languages where there are simply not enough transcribers available. On the other hand, large collections of unannotated recordings are readily available, even for rare and endangered languages: this disproportion between the ease of collecting audio recordings, and their transcription cost, has been termed the « transcription bottleneck » (Seifart et al., 2018). Leveraging large amounts of untranscribed speech through unsupervised representations would extend speech recognition to tasks that are not currently feasible, for example low-resource languages. Major sources of variability in speech are often described as phonetic, speaker and channel related, because of their impact on speech recognition (Lippmann, 1997), speaker recognition (Kenny, Boulianne, Ouellet & Dumouchel, 2007b) or emotion detection (Cummins, Epps, Sethu & Krajewski, 2014).
These claims are supported by empirical studies of speech variability in the spectral domain, such as Kajarekar, Malayath & Hermansky (1999), which conclude that phonemes and phonetic context account for 59.7% of total variability, while speaker and channel variability accounts for 40.3%. In speech-to-text, spoken term retrieval, or language recognition, speaker variations are a nuisance factor that degrade performance, while for speaker diarization or text-independent speaker verification, it is the phonetic content which must be ignored or compensated for. These tasks would all benefit from a representation able to disentangle speaker and phonetic variations. Feeding the downstream classifier only the relevant part of the representation frees the classifier from dealing with unrelated variability, allows it to be simpler and reduces its annotated data requirements. Unsupervised representation learning for speech has received less attention, compared to supervised representations, with most previous work relying on generative models that represent underlying factors of variation, but which do not produce disentangled and interpretable representations. Modified objective functions have been used to encourage disentangling, such as β-VAE (Higgins, Matthey, Pal, Burgess, Glorot, Botvinick, Mohamed & Lerchner, 2017) and mutual information (Chen, Duan, Houthooft, Schulman, Sutskever & Abbeel, 2016; Phuong, Welling, Kushman, Tomioka & Nowozin, 2018). However, as noted in Locatello, Bauer, Lucic, Rätsch, Gelly, Schölkopf & Bachem (2018), purely unsupervised learning of disentangled representations is not possible without inductive biases on both the model and data. Some recent work (Chorowski, Weiss, Bengio & van den Oord, 2019; Hsu, Zhang & Glass, 2017b; Li & Mandt, 2018) has used an inductive bias in the form of a binary opposition of frame vs. utterance levels.
K-means Given a number N of objects (observations) represented by features of dimension D, K-means clusters objects into K groups such that observations within a group are closer than observations across groups. To evaluate closeness, conventional K-means relies on the Euclidean distance measure. More precisely, it groups observations such that the sum of the squared error between the empirical mean of a cluster and the observations in the cluster is minimized. The resulting clustering can be used for classification, data visualization, or as an initialization for more expensive clustering algorithms. The K-means algorithm has several advantages: it is simple and scalable, easy to implement, and works well for a variety of applications (Kulis & Jordan, 2012). Its time complexity is linear in N, D and K. It also has well-known limitations: it can only detect compact hyperspherical clusters, the number of clusters must be specified in advance, each observation is assigned to a single cluster with a hard decision, and it converges to a local minimum of its objective function. Several extensions to K-means have been proposed to overcome these limitations: selection of the number of clusters with the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) (Pelleg & Moore, 2000), use of other distance measures such as Mahalanobis, L1 or Itakura-Saito (Celebi, Kingravi & Vela, 2013), and a large spectrum of initialization methods (Celebi et al., 2013). Spectral clustering for K-means (Zha, He, Ding & Simon, 2001) and kernel K-means (Scholkopf, Smola & Muller, 1996) were introduced mainly to allow more arbitrary shaped clusters, but can be shown to be equivalent (Dhillon, Guan & Kulis, 2005). Spectral clustering has been used in speech enhancement (Hershey, Chen, Le Roux & Watanabe, 2016). As an unsupervised learning method in speech, K-means has had some early successes (Astrahan, 1970) but it was quickly replaced with a better performing approach, Gaussian Mixture Models (GMMs).
Gaussian Mixture Model
The Gaussian Mixture Model is a generative model: observations are assumed to be sampled from a mixture of gaussian distribution, of which each gaussian component is specified by a mean and a covariance. The generative process is to first sample an index for the component (from a prior distribution of component indices), then sample an observation from the associated gaussian distribution, thus giving rise to a mixture of gaussians. To find which mixture component was used in generating each observation (the per-observation component assignment), so that the overall likelihood of observations is maximized, the most commonly used algorithm is Expectation Maximization (EM). EM has been shown to be a probabilistic generalization of K-means by (Welling, 2009) and (Kulis & Jordan, 2012). Thus clustering with a GMM is closely related to K-means, but differs by replacing hard cluster assignments with probabilities and using a covariance-based distance measure. In speech, GMMs have been used as part of Hidden Markov Models (HMMs) in the GMM-HMM globally supervised speech recognition model (Deng & Jaitly, 2016). As an unsupervised learning method, their most common use is in speaker recognition as a Universal Background Model (UBM) for i-vector extraction (Dehak, Kenny, Dehak, Dumouchel & Ouellet, 2011).
INTRODUCTION |