Table of Contents
A Gaussian Mixture Model (GMM) is a mixture of \(K\) multivariate Gaussian distributions. In order to sample from a GMM, one samples first the component index \(k \in \{1,\dots,K\}\) with prior probability \(\pi_k\), and then samples the vector \(\bx \in \mathbb{R}^d\) from the \(k\)-th Gaussian distribution \(p(\bx|\mu_k,\Sigma_k)\). Here \(\mu_k\) and \(\Sigma_k\) are respectively the mean and covariance of the distribution. The GMM is completely specified by the parameters \(\Theta=\{\pi_k,\mu_k,\Sigma_k; k = 1,\dots,K\}\)
The density \(p(\bx|\Theta)\) induced on the training data is obtained by marginalizing the component selector \(k\), obtaining
\[ p(\bx|\Theta) = \sum_{k=1}^{K} \pi_k p( \bx_i |\mu_k,\Sigma_k), \qquad p( \bx |\mu_k,\Sigma_k) = \frac{1}{\sqrt{(2\pi)^d\det\Sigma_k}} \exp\left[ -\frac{1}{2} (\bx-\mu_k)^\top\Sigma_k^{-1}(\bx-\mu_k) \right]. \]
Learning a GMM to fit a dataset \(X=(\bx_1, \dots, \bx_n)\) is usually done by maximizing the log-likelihood of the data:
\[ \ell(\Theta;X) = E_{\bx\sim\hat p} [ \log p(\bx|\Theta) ] = \frac{1}{n}\sum_{i=1}^{n} \log \sum_{k=1}^{K} \pi_k p(\bx_i|\mu_k, \Sigma_k) \]
where \(\hat p\) is the empirical distribution of the data. An algorithm to solve this problem is introduced next.
Learning a GMM by expectation maximization
The direct maximization of the log-likelihood function of a GMM is difficult due to the fact that the assignments of points to Gaussian mode is not observable and, as such, must be treated as a latent variable.
Usually, GMMs are learned by using the Expectation Maximization (EM) algorithm [6] . Consider in general the problem of estimating to the maximum likelihood a distribution \(p(x|\Theta) = \int p(x,h|\Theta)\,dh\), where \(x\) is a measurement, \(h\) is a latent variable, and \(\Theta\) are the model parameters. By introducing an auxiliary distribution \(q(h|x)\) on the latent variable, one can use Jensen inequality to obtain the following lower bound on the log-likelihood:
\begin{align*} \ell(\Theta;X) = E_{x\sim\hat p} \log p(x|\Theta) &= E_{x\sim\hat p} \log \int p(x,h|\Theta) \,dh \\ &= E_{x\sim\hat p} \log \int \frac{p(x,h|\Theta)}{q(h|x)} q(h|x)\,dh \\ &\geq E_{x\sim\hat p} \int q(h) \log \frac{p(x,h|\Theta)}{q(h|x)}\,dh \\ &= E_{(x,q) \sim q(h|x) \hat p(x)} \log p(x,h|\Theta) - E_{(x,q) \sim q(h|x) \hat p(x)} \log q(h|x) \end{align*}
The first term of the last expression is the log-likelihood of the model where both the \(x\) and \(h\) are observed and joinlty distributed as \(q(x|h)\hat p(x)\); the second term is the a average entropy of the latent variable, which does not depend on \(\Theta\). This lower bound is maximized and becomes tight by setting \(q(h|x) = p(h|x,\Theta)\) to be the posterior distribution on the latent variable \(h\) (given the current estimate of the parameters \(\Theta\)). In fact:
\[ E_{x \sim \hat p} \log p(x|\Theta) = E_{(x,h) \sim p(h|x,\Theta) \hat p(x)}\left[ \log \frac{p(x,h|\Theta)}{p(h|x,\Theta)} \right] = E_{(x,h) \sim p(h|x,\Theta) \hat p(x)} [ \log p(x|\Theta) ] = \ell(\Theta;X). \]
EM alternates between updating the latent variable auxiliary distribution \(q(h|x) = p(h|x,\Theta_t)\) (expectation step) given the current estimate of the parameters \(\Theta_t\), and then updating the model parameters \(\Theta_{t+1}\) by maximizing the log-likelihood lower bound derived (maximization step). The simplification is that in the maximization step both \(x\) and \(h\) are now ``observed'' quantities. This procedure converges to a local optimum of the model log-likelihood.
Expectation step
In the case of a GMM, the latent variables are the point-to-cluster assignments \(k_i, i=1,\dots,n\), one for each of \(n\) data points. The auxiliary distribution \(q(k_i|\bx_i) = q_{ik}\) is a matrix with \(n \times K\) entries. Each row \(q_{i,:}\) can be thought of as a vector of soft assignments of the data points \(\bx_i\) to each of the Gaussian modes. Setting \(q_{ik} = p(k_i | \bx_i, \Theta)\) yields
\[ q_{ik} = \frac {\pi_k p(\bx_i|\mu_k,\Sigma_k)} {\sum_{l=1}^K \pi_l p(\bx_i|\mu_l,\Sigma_l)} \]
where the Gaussian density \(p(\bx_i|\mu_k,\Sigma_k)\) was given above.
One important point to keep in mind when these probabilities are computed is the fact that the Gaussian densities may attain very low values and underflow in a vanilla implementation. Furthermore, VLFeat GMM implementation restricts the covariance matrices to be diagonal. In this case, the computation of the determinant of \(\Sigma_k\) reduces to computing the trace of the matrix and the inversion of \(\Sigma_k\) could be obtained by inverting the elements on the diagonal of the covariance matrix.
Maximization step
The M step estimates the parameters of the Gaussian mixture components and the prior probabilities \(\pi_k\) given the auxiliary distribution on the point-to-cluster assignments computed in the E step. Since all the variables are now ``observed'', the estimate is quite simple. For example, the mean \(\mu_k\) of a Gaussian mode is obtained as the mean of the data points assigned to it (accounting for the strength of the soft assignments). The other quantities are obtained in a similar manner, yielding to:
\begin{align*} \mu_k &= { { \sum_{i=1}^n q_{ik} \bx_{i} } \over { \sum_{i=1}^n q_{ik} } }, \\ \Sigma_k &= { { \sum_{i=1}^n { q_{ik} (\bx_{i} - \mu_{k}) {(\bx_{i} - \mu_{k})}^T } } \over { \sum_{i=1}^n q_{ik} } }, \\ \pi_k &= { \sum_{i=1}^n { q_{ik} } \over { \sum_{i=1}^n \sum_{l=1}^K q_{il} } }. \end{align*}
Initialization algorithms
The EM algorithm is a local optimization method. As such, the quality of the solution strongly depends on the quality of the initial values of the parameters (i.e. of the locations and shapes of the Gaussian modes).
gmm.h supports the following cluster initialization algorithms:
- Random data points. (vl_gmm_init_with_rand_data) This method sets the means of the modes by sampling at random a corresponding number of data points, sets the covariance matrices of all the modes are to the covariance of the entire dataset, and sets the prior probabilities of the Gaussian modes to be uniform. This initialization method is the fastest, simplest, as well as the one most likely to end in a bad local minimum.
- KMeans initialization (vl_gmm_init_with_kmeans) This method uses KMeans to pre-cluster the points. It then sets the means and covariances of the Gaussian distributions the sample means and covariances of each KMeans cluster. It also sets the prior probabilities to be proportional to the mass of each cluster. In order to use this initialization method, a user can specify an instance of VlKMeans by using the function vl_gmm_set_kmeans_init_object, or let VlGMM create one automatically.
Alternatively, one can manually specify a starting point (vl_gmm_set_priors, vl_gmm_set_means, vl_gmm_set_covariances).