Next: Conditional Normal Mixtures Up: Topics in Computational Hidden Previous: A Modeling Assembly Language

Subsections


Emphasis Reparameterization

This chapter5.1 illuminates certain fundamental aspects of the nature of normal (Gaussian) mixtures. Thinking of each mixture component as a class, we focus on the corresponding a posteriori class probability functions. It is shown that the relationship between these functions and the mixture's parameters, is highly degenerate - and that the precise nature of this degeneracy leads to somewhat unusual and counter-intuitive behavior. Even complete knowledge of a mixture's a posteriori class behavior, reveals essentially nothing of its absolute nature, i.e. mean locations and covariance norms. Consequently a mixture whose means are located in a small ball anywhere in space, can project arbitrary class structure everywhere in space.

The well-known expectation maximization (EM) algorithm for Maximum Likelihood (ML) optimization may be thought of as a reparameterization of the problem in which the search takes place over the space of sample point weights. Motivated by EM we characterize the expressive power of similar reparameterizations, where the objective is instead to maximize the a posteriori likelihood of a labeled training set. This is relevant to, and a generalization of a common heuristic in machine learning in which one increases the weight of a mistake in order to improve classification accuracy. We prove that EM-style reparameterization is not capable of expressing arbitrary a posteriori behavior, and is therefore incapable of expressing some solutions. However a slightly different reparameterization is presented which is almost always fully expressive - a fact proven by exploiting the degeneracy described above.

Introduction

A normal mixture is a finite stochastic combination of multivariate normal (Gaussian) densities. That is, a probability density function $p$on ${\mathbb{R}}^d$of the form:


\begin{displaymath}
p(x) = \sum_{i=1}^k m_i N_{\Sigma_i,\mu_i}(x)
\end{displaymath}

where $m_1,\ldots,m_k \geq 0$with $\sum_{i=1}^k = 1$, and $N_{\Sigma_i,\mu_i}(x)$denotes the normal density with mean $\mu_i$and covariance $\Sigma_i$. Each constituent normal density is referred to as a component of the mixture, and $m_1,\ldots,m_k$are the mixing coefficients.

Normal mixtures have proven useful in several areas including pattern recognition [DH73] and speech recognition [RJLS85,Bro87,HAJ90] - along with vector quantization and many others. Each component of the mixture is thought of as corresponding to some class denoted $\omega_1,\ldots,\omega_k$, and the a posteriori class probabilities $p(\omega_1\vert x),\ldots,p(\omega_k\vert x)$ are used to effect classification.

This chapter considers certain aspects of the relationship between the two faces of a normal mixture, i.e. the mixture itself versus the a posteriori class functions it induces. Section 5.2 shows that this relationship is not one-to-one and exposes the considerable degeneracy wherein many distinct normal mixtures induce the same class functions.

As a positive result of this degeneracy one can search the entire space of class functions without considering all possible mixtures. This is relevant to problems whose solution depends only on the mixture's second face, i.e. its induced class functions. An example of such a problem is that of maximizing the a posteriori class probability of a labeled training set. Here the objective is to predict the correct labels, not model the observation vectors themselves.

Section 5.3 shows that such search problems may almost always be carried out using reparameterizations we refer to as emphasis methods that are motivated by the simple and intuitive expectation maximization (EM) algorithm [BE67,DLR77,RW84] for unsupervised maximum likelihood (ML) parameter estimation. The reparameterized problem cannot express an arbitrary mixture but nevertheless, because of degeneracy, can express a mixture that induces optimal class functions with respect to the prediction of training labels. To clarify these problems and their relationship to the notion of emphasis we begin with review.

Given an unlabeled training set the EM algorithm iteratively improves (in the ML sense) any mixture - unless it is already locally optimal. The improved means and covariances are given by a simple weighted averaging computation. Given data values $s_1,\ldots,s_n$and a normal mixture $M$, one begins by computing $p(\omega_i\vert s_j,M),
\forall 1 \leq i \leq k, 1 \leq j \leq n$. Conceptually the result is a $k \times n$table of nonnegative values. The rows of this table are then normalized so that their sum is one. Each row then corresponds to a convex combination of the samples. The entries along a row are thought of as weights attributed to the sample. Each improved mean vector $\mu_i$is merely the corresponding weighted average (convex combination) of the sample vectors. Each improved covariance is the sample convex combination of outer products $(s_j -
\mu_i) (s_j - \mu_i)^t$. The improved mixing parameters are obtained by normalizing the vector consisting of the table row weights prior to their normalization. This process is repeated in an elegant cycle, i.e. each table induces a mixture that gives rise to a new table, and so on.

We view this table as reparameterizing the problem of searching for an ML mixture. The point is that one might instead have used a direct parameterization consisting of the individual means, covariance matrices, and mixing parameters - and employed standard constrained optimization methods from numerical analysis. We loosely say that such a table driven scheme is an emphasis reparameterization.

Only a locally optimal solution is found through EM but it is important to realize that the global optimum is a stationary point of the iteration. This means that it can be expressed via emphasis, i.e. that there exists a table which induces the globally optimal mixture. This is interesting because not all mixtures can be induced from the table. In particular, the induced means must lie within the convex hull of the sample set, and the covariances are similarly constrained.

The optimization problem corresponding to the other, a posteriori face of the mixture seeks to maximize the a posteriori likelihood of a labeled training set. That is, to find a model $M$which maximizes:


\begin{displaymath}
\prod_{i=1}^n p(\omega(s_i)\vert s_i,M)
\end{displaymath}

where $\omega(s_i)$is the class label associated with sample $s_i$. This is sometimes called the maximum mutual information (MMI) criterion in the speech recognition literature [Bro87]. We remark that more general forms of this problem may be stated in which the labels are themselves probabilities. Sections 5.3 and 5.4 consider the question of whether emphasis-like reparameterizations may be used to attack the more general form of this problem.

This is relevant to a common heuristic in machine learning in which one somehow increases the weight of (emphasizes) a mistake in order to improve classification accuracy. It is natural to wonder whether such approaches are even capable of expressing the solution. In the case of normal mixtures this chapter illuminates the issue considerably. By theorem 6 one can almost always exactly match the a posteriori behavior of an arbitrary mixture by an emphasis method that slightly modifies the EM approach. Section 5.4 exposes the limitations of EM-style emphasis and shows that it is not in this sense universal.

These results depend on a detailed understanding of the degenerate relationship between a normal mixture and its induced class functions. A convenient visualization of these a posteriori class functions, focuses on decision boundaries, i.e. the surfaces along which classification is ambiguous.

Figure: A even mixture of two normal densities in ${\mathbb{R}}^2$, each with identity covariance. The two classes are separated by a linear decision boundary.
\scalebox{0.75}{\includegraphics{figs/decide.fig.ps}}

Imagery like ours in figure 11, and pages 28-31 of [DH73], suggest an intuitive relationship between mixture component locations, and the resulting a posteriori class structure and decision surfaces. One imagines each mean to be asserting ownership over some volume of space surrounding it. This view is however far from the truth and the a posteriori class structures arising from normal mixtures are far more interesting than these simple examples suggest.

Theorem 5 reveals that the a posteriori class behavior of normal mixtures is far stranger and counter-intuitive than is generally understood. It shows that knowledge of a mixture's a posteriori class structure (which may be thought of as decision surfaces), tells us essentially nothing about the absolute location of the mixture's means - or the absolute nature of its covariances. One can easily imagine a normal mixture, and picture its corresponding class structure. Now if one is instead given only this class structure, the theorem says that infinitely many normal mixtures are consistent with it - and in particular that the means of such a mixture can be located within an arbitrarily small ball, located anywhere in space.

Despite the popularity of normal mixtures for classification problems, the precise nature of this highly degenerate relationship between class functions and mixtures does not seem to have been considered in the literature. Part of the reason may be that the simple view in which means dominate some portion of the space around them is exactly the case when all covariances have identical determinant, and uniform mixing coefficients are used. The most common example of this setting consists of nearest-neighbor Euclidean clustering which corresponds is the identity covariance case. In practice, if the determinants are somewhat comparable, and the mixing coefficients nearly uniform, the simple view is almost always valid. But in many cases, such as when estimating mixture parameters, no such constraint is imposed, and the means, covariance matrices, and mixing coefficients are free to assume any values. Here the full range of strange behavior may be expected.


A Posteriori Degeneracy

Let $s_1,\ldots,s_n \in {\mathbb{R}}^d$be a set of observations. Any set of positive weights $\gamma_1,\ldots,\gamma_n$, gives rise in a natural way to a normal density. That is, the normal density having mean:


\begin{displaymath}
\mu = \frac{1}{\sum_{i=1}^n \gamma_i} \sum_{i=1}^n \gamma_i s_i
\end{displaymath} (5.1)

and covariance:


\begin{displaymath}
\Sigma = \frac{1}{\sum_{i=1}^n \gamma_i}
\sum_{i=1}^n \gamma_i (s_i-\mu)(s_i-\mu)^t
\end{displaymath} (5.2)

which may be thought of as the weighted maximum likelihood (ML) model corresponding to the observations. Given $k$sets of weights, as many normal densities arise. Adding $k$additional positive mixing weights $w_1,\ldots,w_k$serves to specify a mixture of these $k$ densities, with the probability of the $i$th component given by $w_i/\sum_{i=1}^k w_i$. Thus a total of $nk + k$weights induce a normal mixture, and we loosely refer to them as emphasis parameters.

The well known expectation maximization (EM) method may be viewed as a process which accepts as input a normal mixture, and outputs an emphasis parameter set - from which an improved normal mixture arises. This cycle continues as the algorithm climbs towards a local maximum. The global maximum is a stationary point of this iteration.

From our viewpoint, what is interesting here is that it is possible to search for an optimal normal mixture, by searching over the space of values for the $nk + k$emphasis parameters - rather than using the canonical parameterization consisting of the unknown mean vectors, covariance matrices, and mixing coefficients.

Now the emphasis parameterization above cannot express an arbitrary mixture. This is because each induced mean is a convex combination of the observations, and must therefore lie within their convex hull. An immediate corollary to our EM comments above is then that the means of an optimal mixture lie within the convex hull of the observation set. Similarly, not all covariance matrices can be generated by emphasis, and the covariances of an optimal mixture are similarly constrained.

Expectation maximization strives to maximize the probability of the observation set. If instead the function to be optimized depends only on the a posteriori behavior of the mixture, it is natural to ask:

When can emphasis of an observation set induce a mixture which is class equivalent to an arbitrary one?

Obviously the observation set must satisfy certain orthodoxy conditions such as being of sufficient size and possessing in some sense enough independence. But beyond these issues the fact that many mean vectors and covariance matrices are not expressible by emphasis would seem to present a much larger obstacle.

In this section we show that the the relationship between a normal mixture and its class functions is highly degenerate - so much so that constrained expressiveness of emphasis does immediately rule out an affirmative answer to our question. Examples and discussion follow the main mathematical development. The section ends with a technical convergence-rate result that is needed to establish the next sections' reparameterization theorem.

Definition 5   A finite mixture is a probability function on a measure space ${\cal X}$, arising from $k$conditional probability functions (components) as follows:


\begin{displaymath}
p(x) = \sum_{i=1}^k p(x,\omega_i) =
\sum_{i=1}^k p(x\vert\omega_i) p(\omega_i)
\end{displaymath}

where the constants $\{p(\omega_i)\}$are referred to as the mixing coefficients. Any such mixture induces $k$a posteriori class functions:


\begin{displaymath}
p(\omega_i\vert x) = \frac{p(x\vert\omega_i) p(\omega_i)}{p(x)}
\end{displaymath}

Two finite mixtures are class-equivalent if they induce the same class functions.

The class functions are not independent since they are constrained by: $\sum_k p(\omega_i\vert x) = 1$. We begin with a simple proposition which allows us in what follows, to focus on ratios of conditional probabilities, and ignore constant factors. This will be important since the nonconstant portion of the ratio of two normal densities is a rather simple object.

Proposition 4   Let $p(x)$be a k-component mixture, and $p^\prime(x\vert\omega_1),\ldots, p^\prime(x\vert\omega_k)$be a collection of strictly positive conditional probability functions. If for some $j$, there exist constants $C_{i,j}$such that $\forall i \neq j$and $x \in {\cal X}$:


\begin{displaymath}
\frac{p(x\vert\omega_i)}{p(x\vert\omega_j)} =
C_{i,j} \cdot
\frac{p^\prime(x\vert\omega_i)}{p^\prime(x\vert\omega_j)}
\end{displaymath} (5.3)

then there exist mixing coefficients $p^\prime(\omega_1),\ldots,
p^\prime(\omega_k)$such that the resulting mixture $p^\prime(x)$is class-equivalent to $p(x)$.

proof: We begin by giving a formula for mixing coefficients $p^\prime(\omega_1),\ldots, p^\prime(\omega_1)$, such that:


\begin{displaymath}
\underbrace{C_{i,j} \frac{p(\omega_i)}{p(\omega_j)}}_{k_{i,j}} =
\frac{p^\prime(\omega_i)}{p^\prime(\omega_j)}
\end{displaymath} (5.4)

Relaxing for the moment the restriction $\sum_{\ell=1}^k p(\omega_\ell) = 1$, observe that if $p(\omega_j)$ were $1$, then setting $p(\omega_i) = k_{i,j}, i \neq j$, would satisfy the equation. Normalization then yields a solution which does sum to $1$:

\begin{eqnarray*}
p^\prime(\omega_j) & = &
\frac{1}{1 + \sum_{\ell \neq j} k_{...
...c{k_{i,j}}{1 + \sum_{\ell \neq j} k_{\ell,j}}
\;\; i \neq j\\
\end{eqnarray*}



Multiplying each side of Eq. 12 by the corresponding side of Eq. 11 yields:


\begin{displaymath}
\frac{p(x\vert\omega_i)p(\omega_i)}{p(x\vert\omega_j)p(\ome...
...j)p^\prime(\omega_j)}
\; \; \forall i \neq j, x \in {\cal X}
\end{displaymath} (5.5)

The approach above works so long as $p(\omega_j) \neq 0$. If it is zero, then $p$may be treated as a $k-1$component mixture, and the proposition follows by induction. Till now $j$has been fixed and Eq. 13 is established for $i \neq j$. This equation is however easily extended to any pair $i,\ell$of indices. Let $f_i$denote $p(x\vert\omega_i)p(\omega_i)$and $f^\prime $denote $p^\prime (x\vert\omega_i)p^\prime (\omega_i)$. Then:


\begin{displaymath}
\frac{f_i}{f_\ell} = \frac{f_i}{f_j} \cdot \frac{f_j}{f_\el...
...e _j}{f^\prime _\ell}
= \frac{f^\prime _i}{f^\prime _\ell}
\end{displaymath}

Now we may write:


\begin{displaymath}
p(\omega_i\vert x) = \frac{f_i}{\sum_{\ell=1}^k f_\ell} =
\frac{1}{1 + \sum_{\ell \neq i} f_\ell/f_i}
\end{displaymath}

and a corresponding expression for $p^\prime (\omega_i\vert x)$ in terms of $\{f^\prime _\ell\}$. We have already seen that all ratios $f_i/f_\ell = f^\prime _i/f^\prime _\ell$, so $\sum_{\ell \neq i} f_\ell/f_i =
\sum_{\ell \neq i} f^\prime _\ell/f^\prime _i$whence $p(\omega_i\vert x) = p^\prime (\omega_i\vert x)$ and we are done.$\;\Box$

We now specialize our discussion to mixtures whose components are normal densities.

Definition 6   A $d$-dimensional $k$-component normal mixture, is a finite mixture of multivariate normal densities:


\begin{displaymath}
N_{\mu,\Sigma}(x) \triangleq \frac{1}{(2\pi)^{d/2}
\Sigma^{1/2}} \cdot
e^{-\frac{1}{2} (x-\mu)^t \Sigma^{-1} (x-\mu)}
\end{displaymath} (5.6)

The mixture's parameters are then $\Phi
= \{\Sigma_1, \ldots, \Sigma_k, \mu_1, \ldots, \mu_k,
p(\omega_1), \ldots, p(\omega_k)\}$.

The following theorem exposes the large degeneracy in the relationship between normal mixtures, and their induced class functions.

Theorem 5   Let $p$be a $d$-dimensional normal mixture with $k$components. For any $x \in {\mathbb{R}}^d$and $\epsilon > 0$, there exists a $d$-dimensional $k$-component normal mixture $p^\prime $, such that:

  1. $\Vert{\mu^\prime}- x\Vert < \epsilon$
  2. $\Vert{\Sigma^\prime}\Vert _2 < \epsilon$
  3. $p^\prime $and $p$are class-equivalent

where $\Vert{\Sigma^\prime}\Vert _2$refers to the Frobenius/Euclidean matrix norm.

proof: Begin by selecting some distinguished component $j$. By proposition 4 we have only to produce ${\Sigma^\prime}_1,\ldots,{\Sigma^\prime}_k$and ${\mu^\prime}_1,\ldots,{\mu^\prime}_k$ such that the ratio conditions of the proposition are satisfied. Since constant factors do not matter, we ignore several portions of the definition of a normal density Eq. 14. Namely, the leading constant and the constant portion of the exponent once multiplied out. The result is that:


\begin{displaymath}
N_{\mu,\Sigma}(x) \propto e^{-\frac{1}{2}(x^t\Sigma^{-1}x -
2\mu^t \Sigma^{-1} x)}
\end{displaymath}

The proportional ratio condition of proposition  4 then leads immediately to the following necessary and sufficient conditions:


\begin{displaymath}
\left.
\begin{array}{rcl}
\Sigma_i^{-1} - \Sigma_j^{-1} ...
...gma^\prime}_j^{-1}
\end{array} \right\} \;\; \forall i \neq j
\end{displaymath} (5.7)

We set ${\mu^\prime}_j = x$and begin by sketching the rest of the proof. The constraints are satisfied by choosing ${\Sigma^\prime}_j^{-1}$so that each of its eigenvalues is large. Each resulting ${\Sigma^\prime}_i^{-1}$must then be positive definite and will also have large eigenvalues. Both ${\Sigma^\prime}_j$and ${\Sigma^\prime}_i$then have small norm, satisfying the requirements of the theorem. In this process, it is only ${\Sigma^\prime}_j^{-1}$we are free to choose. Each choice determines the ${\Sigma^\prime}_i$, and ${\mu^\prime}_i$(since we have already set ${\mu^\prime}_j = x$). In the limit, as we choose ${\Sigma^\prime}_j^{-1}$with increasingly large eigenvalues, $\Vert{\mu^\prime}_i - {\mu^\prime}_j\Vert = \Vert{\mu^\prime}_i - x\Vert$approaches zero whence we can satisfy the theorem's other condition.

Denote by $\overline{\lambda}(A)$the largest eigenvalue of positive definite matrix $A$, and by $\underline{\lambda}(A)$the smallest. For matrices $A,B$, it then follows easily from the Cauchy-Schwartz inequality that $\underline{\lambda}(A+B) \geq \underline{\lambda}(A) -
\overline{\lambda}(B)$, by writing $\Vert[(A+B)+(-B)]u\Vert \leq \Vert(A+B)u\Vert + \Vert-Bu\Vert$ where $u$denotes any unit length vector. Now the first constraint in Eq. 15 may be written:


\begin{displaymath}
{\Sigma^\prime}_i^{-1} = {\Sigma^\prime}_j^{-1} + (\Sigma_i^{-1} - \Sigma_j^{-1})
\end{displaymath} (5.8)

and we then have:


\begin{displaymath}
\underline{\lambda}({\Sigma^\prime}_i^{-1}) \geq \underline...
..._j^{-1})
- \overline{\lambda}(\Sigma_i^{-1} - \Sigma_j^{-1})
\end{displaymath}

The parenthesized term is constant since we are given both $\Sigma_i^{-1}$and $\Sigma_j^{-1}$, and by subadditivity of symmetric matrix spectral radii, does not exceed their sum $\overline{\lambda}(\Sigma_i^{-1})
+ \overline{\lambda}(\Sigma_j^{-1})$. We can easily choose ${\Sigma^\prime}_j^{-1}$such that $\underline{\lambda}({\Sigma^\prime}_j^{-1})$is arbitrarily large, e.g. $c\cdot I$ where $c$is a large positive constant. It follows then that $\underline{\lambda}({\Sigma^\prime}_i^{-1})$will also be arbitrarily large, ensuring that it is positive definite.

Next recall that the column norms of a matrix $A$cannot exceed its spectral radius (operate on each member of the canonical basis). In our positive definite setting each is then bounded above by $\overline{\lambda}(A)$, so that $\Vert A\Vert _2 \leq \sqrt{d} \overline{\lambda}(A)$. Choosing the smallest eigenvalue of ${\Sigma^\prime}_j^{-1}$and ${\Sigma^\prime}_i^{-1}$to be arbitrarily large, forces $\overline{\lambda}({\Sigma^\prime}_j)$and $\overline{\lambda}({\Sigma^\prime}_i)$to be arbitrarily small - satisfying the theorem's first condition.

We set ${\mu^\prime}_j$to be equal to $x$from the statement of the theorem, and the second constraint in Eq. 15 may be rearranged to yield:


\begin{displaymath}
{\mu^\prime}_i = ({\Sigma^\prime}_i {\Sigma^\prime}_j^{-1})...
...Sigma^\prime}_i (\Sigma_i^{-1} \mu_i + \Sigma_j^{-1} \mu_j) ]
\end{displaymath} (5.9)

By our earlier discussion, we may force ${\Sigma^\prime}_i$to be arbitrarily close to zero - forcing the bracketed term in Eq. 17 to approach zero as well. We next show that the first parenthesized term ${\Sigma^\prime}_i {\Sigma^\prime}_j^{-1}$ tends to the identity matrix $I$as $\underline{\lambda}({\Sigma^\prime}_j^{-1}) \rightarrow \infty$. This then demonstrates that ${\mu^\prime}_i \rightarrow x$satisfying the theorem's second condition, and finishes the proof.

It is easy to see that ${\Sigma^\prime}_i {\Sigma^\prime}_j^{-1} \rightarrow I$if and only if $({\Sigma^\prime}_i {\Sigma^\prime}_j^{-1})^{-1} = {\Sigma^\prime}_j {\Sigma^\prime}_i^{-1} \rightarrow I$. Using Eq. 16 this becomes:


\begin{displaymath}
I + {\Sigma^\prime}_j (\Sigma_i^{-1} - \Sigma_j^{-1}) \rightarrow I
\end{displaymath}

which is clearly the case since ${\Sigma^\prime}_j \rightarrow 0$.$\;\Box$

To recapitulate, any location for ${\mu^\prime}_j$, and sufficiently large (in the $\underline{\lambda}$sense) ${\Sigma^\prime}_j^{-1}$, will give rise using the proof's constructions, to values for the other means, covariances, and mixture coefficients, such that a class-equivalent mixture results. The proof goes on to establish that in the limit, everything is confined as required within an $\epsilon$-neighborhood.

Figure 12: Two ways to induce a single class function. An illustration of the degeneracy in the relationship between normal mixtures, and their induced class functions. Zero mean normal densities G1 and G2 when mixed evenly induce C1, the a posteriori probability of G1. Normal densities G1' and G2' located in the right portion of the graph do not have zero means, but may be mixed (in a far from even way) so that C1 results.
\scalebox{0.75}{\includegraphics{figs/cannon.ps}}

Figure 12 provides a simple one dimensional example of the normal mixture to class function degeneracy. The left pair of Gaussians G1,G2 both have zero mean. The taller one, G1 corresponds to the first class and has variance $1/2$, while G2 has variance 1 and corresponds to the second class. When mixed evenly, the a posteriori probability C1 results. Notice that C1 assumes value $1/2$where G1 crosses G2. The right pair of Gaussians G1',G2' have means at $4$and $5$, and variances of $1/5$and $1/4$respectively. An even mixture of them would result in a class function very different from C1. But if very little weight ( $\approx 5.74234\times 10^{-5}$) is placed on G1', its mode drops under the graph of G2', and surprisingly, the induced class function is exactly C1. The parameters of this new mixture follow immediately from the calculations within the proof of theorem  5.

Figure 13 depicts another two class problem, this time in two dimensions. Here various elements of each class are arranged so that they may be perfectly separated by a circular decision boundary. It is clear that we can create such a boundary by modeling each class with a normal density centered at the origin. We may evenly mix these components and set their covariance matrices to be multiples of the identity matrix; chosen so that the two density's intersection is the desired circle. This is in some sense the canonical solution but infinitely many others are possible. To begin with, it is not necessary that both covariance matrices be multiples of the identity. It suffices that their difference be such a multiple. It is not necessary that their means be located at the origin. Given a choice of covariance matrices, and the location of one mean, a location for the other may be computed which gives rise to the same boundary. Moreover, by choosing the covariances to be nearly zero, corresponding to highly peaked densities, the two means may be located within an arbitrary $\epsilon$-neighborhood anywhere in ${\mathbb{R}}^2$. This illustrates the degeneracy by focusing only on the decision boundary, but theorem  5 shows that the family of class functions may be matched everywhere. The situation in figure 13 was first contrived in an attempt to disprove our emphasis reparameterization theorem, the subject to which we next turn. The data points are arranged along only part of the circle so that no convex combination of them, can possibly express the origin. That is, the origin is outside of the convex hull of the dataset. At that time we thought that this would prevent such combinations from leading to a mixture capable of inducing the desired circular decision boundary. The revelation of theorem 5 is that the component means of such a mixture can be located anywhere.

Figure 13: If data points of two colors are imagined to hug the inside and outside of a circular decision boundary, one might think that a Gaussian mixture model capable of separating the classes, would necessarily have component means at the center - but this is not the case.
\scalebox{0.75}{\includegraphics{figs/clock.fig.ps}}

We have seen that all the ${\Sigma^\prime}_i$converge to ${\Sigma^\prime}_j$as the latter tends to zero. At the same time, the ${\mu^\prime}_i$approach ${\mu^\prime}_j$. The rate of convergence is our next topic. In particular we will show that $\vert{\Sigma^\prime}_i - {\Sigma^\prime}_j\vert \rightarrow 0$ quadratically as ${\Sigma^\prime}_j \rightarrow 0$. This is formalized in the following proposition which is needed to establish the main result of the next section. We remark that the convergence of each ${\mu^\prime}_i$to ${\mu^\prime}_j$is only linear.

Proposition 5   $\Vert{\Sigma^\prime}_i - {\Sigma^\prime}_j\Vert = O(\overline{\lambda}^2({\Sigma^\prime}_j))$

proof: The difference $\Sigma_i^{-1} - \Sigma_j^{-1}$is constant and is denoted by $C$. Then:


\begin{displaymath}
{\Sigma^\prime}_i-{\Sigma^\prime}_j = ({\Sigma^\prime}_j^{-1} + C)^{-1}
- ({\Sigma^\prime}_j^{-1})^{-1}
\end{displaymath}

Denoting ${\Sigma^\prime}_j^{-1}$as $P$for clarity this becomes:


$\displaystyle (P+C)^{-1}- P^{-1}$ $\textstyle =$ $\displaystyle [(I + CP^{-1}) P]^{-1}- P^{-1}$  
  $\textstyle =$ $\displaystyle P^{-1}[I - (I + CP^{-1})^{-1}]$ (5.10)

Now clearly $I+CP^{-1}\rightarrow I$with $\overline{\lambda}(P^{-1})$and the eigenvalues of $I + CP^{-1}$approach unity at this rate. So $(I+CP^{-1})^{-1}\rightarrow I$as well - also with $\overline{\lambda}(P^{-1})$. Hence $\Vert I - (I + CP^{-1})^{-1}\Vert = O(\overline{\lambda}(P^{-1}))$in Eq. 18. But $\Vert P^{-1}\Vert = O(\overline{\lambda}(P^{-1}))$as well, so $\Vert(P+C)^{-1}- P^{-1}\Vert = O(\overline{\lambda}^2(P^{-1}))$.$\;\Box$


Emphasis Reparameterization

The question ``Can emphasis induce a mixture class-equivalent to an arbitrary one?'' posed at the beginning of the previous section is reasonable since from theorem 5 we know that the means of such a mixture can lie within an arbitrarily small ball located anywhere in space. So confinement to the convex hull of the samples is not necessarily a restriction. Likewise the covariances can be chosen to have arbitrarily small Frobenius norm.

The main result of this section is a qualified positive answer to this question. It states that a modified form of emphasis can, for almost all sufficiently large observation sets, induce a mixture which is class equivalent to an arbitrary one.

Modifications are needed because even though the means and covariances may be chosen with a great deal of freedom, they are still highly constrained by the conditions of Eq. 15 - and it is sometimes impossible to generate a mixture which has satisfies these constraints by the form of emphasis employed by EM. The next section expands on the limits of EM-style emphasis.

The required modifications consist of a single change to Eq. 10 which becomes:


\begin{displaymath}
\Sigma = \sum_{i=1}^n \gamma_i (s_i-\mu)(s_i-\mu)^t
\end{displaymath} (5.11)

where the leading normalization has been removed. This allows the scale of the covariance to be adjusted without affecting the mean.

Before stating and proving our reparameterization result, a particular matrix is defined whose nonsingularity is a condition of the theorem.

Condition 1   Denote by $S_i$the column vector formed by concatenating $s_i s_i^t$and $s_i$, and let $n$(the number of samples) equal $d(d+3)/2$. Next form square matrix $S$by combining these columns. For example, if $d=2$we have:


\begin{displaymath}
S \triangleq \left(
\begin{array}{ccccc}
s_{\scriptscript...
...yle{4,2}} &
s_{\scriptscriptstyle{5,2}}
\end{array} \right)
\end{displaymath}

Our condition is that $S$be nonsingular.

It is not hard to show that this condition is almost always satisfied, a fact formalized by:

Proposition 6   Regarding a sample of size $d(d+3)/2$as a single random variable of dimension $d^2(d+3)/2$, the set of such vectors which give rise to singular $S$matrices, has measure zero.

proof: Since the matrix consists of monomials of distinct structure, no non-zero linear combination (over the ring of polynomials) of its columns or rows vanishes. So its determinant, as a polynomial, is not identically zero. But the roots of a non-zero polynomial form a set of zero measure. $\;\Box$

Next we simplify our setting, adjust notation, and establish one more proposition before turning to the theorem. Without loss of generality, we may assume that our samples have mean zero. To see this, first let $t$denote some target mean, where we seek stochastic $\gamma$values such that $\sum \gamma_i s_i = t$. We may equivalently solve $\sum \gamma_i s_i^\prime = t^\prime$, where $s_i^\prime \triangleq s_i - E[S]$and $t^\prime = t-E[S]$, since $\sum \gamma_i s_i^\prime = (\sum \gamma_i s_i) - E[S]$. The problem of expressing a target covariance is unchanged by coordinate translation, since $(s_i - t) = (s_i^\prime - t^\prime)$. Notice that this is true even if the $\gamma$are not stochastic. So in what follows, we will drop the prime notation and simply assume that $E[\{s_i\}] = 0$.

The mixture we generate will have all of its means located near zero, the mean of the sample set. The distinguished mixture component, identified with index $j$in the previous section, is located exactly at the origin. The location of the other mean vectors, are thought of as small displacements $\delta$from the origin. We then write $s_i(\delta)$to mean $s_i - \delta$, and the $S$matrix is then parameterized by $\delta$in the obvious way, and denoted $S(\delta)$. Note that $S(0)$is just $S$as originally defined.

Proposition 7   Given nonsingular $S$, $\exists c > 0$such that $\forall \delta$satisfying $\Vert\delta\Vert < c$, $S(\delta)$is nonsingular.

proof: Immediate from the continuity of the $\delta$parameterization of $S$. $\;\Box$

Each value of $\delta$corresponds to a choice of mean vector, and gives rise to a linear mapping $S(\delta)$. That is, for each choice of $\delta$, a linear map arises. The proposition tells us that so long as our mixture's means are kept within distance $c$of the origin, all associated linear mappings are invertible.

The utility of $S(\delta)$is that it allows us to simultaneously express the objectives of obtaining by emphasis a specified mean vector, and covariance matrix. If $\Gamma$is a nonnegative emphasis vector, $\delta_t$is the targeted mean vector, and $\Sigma_t$the desired covariance matrix, then we must solve the system $S(\delta_t) \Gamma =
\Sigma_t : 0$. Here $\Sigma_t$is regarded as a vector which is concatenated (denoted ``:'') with the d-dimensional zero vector.

The $\Gamma$above is nonnegative, but not necessarily stochastic. This may seem to be inconsistent with our definition of the generation of a mean vector by emphasis, because of the normalization it includes. But it is not since $\sum \gamma_i (s_i - \delta) = 0$is equivalent to $(1/\sum \gamma_i) \sum \gamma_i = \delta$, which is exactly our definition.

Theorem 6   Given any $d$-dimensional normal mixture $M$with $k$ components, and $s_1,\ldots,s_n \in {\mathbb{R}}^d$such that $n \geq
d(d+3)/2$with some subset of size $d(d+3)/2$satisfying condition 1, then there exists a $k \times n$ table of nonnegative values $\{\gamma_{i,j}\}$, and nonnegative values $m_1,\ldots,m_{k-1}$with $\sum m_k \leq
1$, such that the normal mixture $M^\prime$generated as described below, is class-equivalent to $M$.

  1. The mixing parameters of $M^\prime$are $m_1,\ldots,m_{k-1},1-\sum_{j=1}^{k-1} m_j$.

  2. Each mean $\mu_i^\prime$within $M^\prime$is given by:


    \begin{displaymath}
\mu_i = \frac{1}{\sum_{j=1}^n \gamma_{i,j}}
\sum_{j=1}^{n} \gamma_{i,j} s_j
\end{displaymath}

  3. Each covariance $\Sigma_i^\prime$within $M^\prime$is given by:


    \begin{displaymath}
\Sigma_i = \sum_{j=1}^{n} \gamma_{i,j} (s_j-\mu_i)(s_j-\mu_i)^t
\end{displaymath}

Also, the number of parameters may be reduced to $(k-1)d(d+3)/2+d$, matching the complexity of the canonical parameterization, by replacing the top $\gamma$-table row, by a single new parameter $\alpha$.

proof: Choose some subset of the $\{s_i\}$of size exactly $d(d+3)/2$ that satisfies condition 1. We disregard the other elements entirely, and therefore simply assume that $n =
d(d+3)/2$. As argued earlier, we may also assume without loss of generality that the $\{s_i\}$have mean zero. Then the distinguished mixture component from theorem 5, with parameters $({\mu^\prime}_j,
{\Sigma^\prime}_j)$, is chosen as follows. Its mean ${\mu^\prime}_j$is the origin, and its covariance ${\Sigma^\prime}_j$is proportional to the average of the element self-outer-products, i.e.:


\begin{displaymath}
{\Sigma^\prime}_j = \alpha \frac{1}{n} \sum_{i=1}^n s_i s_i^t
\end{displaymath}

This may be thought of as choosing the first row of the $\gamma$table to be $1/n$, and introducing a new scale parameter $\alpha$. The table's top row elements are no longer parameters, so the total becomes $(k-1)d(d+3)/2$table parameters, plus $d-1$mixing parameters, plus $\alpha$- matching the count in the statement of the theorem. As described in the proof of theorem5, the choice of ${\mu^\prime}_j$is arbitrary, but ${\Sigma^\prime}_j$may need to be scaled down to preserve the positive nature of all matrices. The number of resulting parameters in a direct parameterization (i.e. consisting of means, covariances, and mixing parameters), then matches our count.

As $\alpha \rightarrow 0$we know that the ${\mu^\prime}_i \rightarrow
{\mu^\prime}_j = 0$. Our first step is to choose $\alpha$ sufficiently small, so that the largest $\Vert{\mu^\prime}_i\Vert$is smaller than the $c$of proposition 7. Next, $\alpha$is possibly reduced further until each covariance matrix ${\Sigma^\prime}_i$arising from Eq. 7 is positive.

Each ${\mu^\prime}$corresponds to a $\delta$value in our definition of $S(\delta)$. The reference mean, located at the origin, corresponds to $\delta=0$, and by our construction, arises from weighting the $s_\ell$by non-negative $\gamma$values - in this case uniform ones. Associated with each ${\mu^\prime}_i$there is also a ${\Sigma^\prime}_i$. Recall from our earlier discussion, that our focus is on the equation $S(\mu_i) \Gamma = \Sigma_i :
0$. Since $S(\delta)$is non-singular for each of the ${\mu^\prime}_i$, we know that some vector of $\gamma$values satisfies this equation. We have only to show that solutions consisting only of nonnegative values can be found.

We will say than a point is representable under some $S(\delta)$, if its inverse image under this mapping consists only of non-negative values. There exists a representable open neighborhood about $\alpha ({\Sigma^\prime}_j : 0)$under $S(0)$, since the inverse image of this point5.2 is the constant vector $\alpha/n$, and consists of strictly positive values.

Within our bounds on $\delta$, it uniform-continuously5.3 parameterizes $S(\delta)$. Hence there exist values $c^\prime$and $\epsilon$, such that for all $\delta$with $\Vert\delta\Vert \leq c^\prime$, the open ball about $\alpha ({\Sigma^\prime}_j : 0)$, with radius $\epsilon$, is representable under $S(\delta)$. To avoid introducing another symbol, let $\epsilon$now denote the maximum such radius.5.4

Next notice that the space of representable points is convex and includes the origin. So representability of $\alpha ({\Sigma^\prime}_j : 0)$implies representability for all $\alpha$values. Now if necessary, reduce $\alpha$further so that all the ${\mu^\prime}$are within $c^\prime$of the origin.

Now let $P$denote the subspace of nonnegative $\Gamma$vectors. Let:


\begin{displaymath}
T \triangleq \bigcap_{\Vert\delta\Vert \leq c^\prime} S(\delta) P
\end{displaymath}

Subspace $T$is the portion of the range, representable under any $S(\delta)$such that $\delta \leq c^\prime$.

About $\alpha ({\Sigma^\prime}_j : 0)$we have seen that there is a representable ball of radius $\epsilon$which touches the boundary of $T$. We now denote this radius by $\epsilon(\alpha)$since we are about to consider its behavior as $\alpha \rightarrow 0$. By our earlier comments, $T$includes $\alpha ({\Sigma^\prime}_j : 0)$, and all proportional vectors, in particular as $\alpha \rightarrow 0$.

The geometric observation key to our proof is that $\epsilon(\alpha)$can shrink only as fast as $\alpha$itself, since this value represents the shortest distance to some point set, in this case $T^c$. We imagine $T$ to be a tunnel, leading to the origin while constricting, with the boundary of $T^c$forming the tunnel's wall.

Focus now on some ${\mu^\prime}_i$. While there is a corresponding representable ball about $\alpha ({\Sigma^\prime}_j : 0)$of radius $\epsilon(\alpha)$, there is no guarantee that ${\Sigma^\prime}_i$is within it. As $\alpha \rightarrow 0$, we have seen that the radius of this ball shrinks linearly with $\alpha$. But the distance of ${\Sigma^\prime}_i$from ${\Sigma^\prime}_j$by proposition 5 shrinks quadraticly with $\alpha$, whence eventually, i.e. for small enough $\alpha$, ${\Sigma^\prime}_i$ becomes representable. Then $\alpha$is reduced as necessary for for each component of the mixture, so that afterwards, every ${\Sigma^\prime}: 0$is representable.$\;\Box$

Other emphasis theorems are possible. In particular it is much easier to establish a similar result in which the weights may be either positive or negative because then the only requirement is that $S(\delta)$ be nonsingular.


The Limitations of EM-style Emphasis

In the previous section we saw that using a particular form of emphasis reparameterization, the a posteriori behavior of any mixture could be matched. We say that such a sample set and emphasis method is universal. This section is by contrast essentially negative, and begins with an example demonstrating that even in one dimension, EM-style emphasis is not universal.

Our example is in ${\mathbb{R}}^1$, and includes two observation points located at $0$and $1$. The leading normalization factors in Eq. 10,9 amount to enforcing convex combination, so there is only one degree of freedom given our two element observation set. If $\gamma$denotes the weight of the first point, then the second has weight $1-\gamma$. The induced mean is just $\gamma$and the induced variance $\gamma(1-\gamma)$. Our objective is to generate a two element mixture which is class equivalent to an arbitrary one, so two emphasis parameters $\gamma_1$and $\gamma_2$are needed. The two constant differences to be matched (from Eq. 15) are denoted $\Delta_\Sigma$and $\Delta_\mu$. That these are independent and unconstrained characteristics an equivalence class follows from the observation that we may, without loss of generality, set one of the target means to zero. The constraints then become:

\begin{eqnarray*}
\frac{1}{\gamma_1(1-\gamma_1)} - \frac{1}{\gamma_2(1-\gamma_2...
...
\frac{1}{1-\gamma_1} - \frac{1}{1-\gamma_2}
& = & \Delta_\mu
\end{eqnarray*}



Choose $\Delta_\mu > 0$. From our second constraint we have $\gamma_1 = 1 - 1/[\Delta_mu + 1/(1-\gamma_2)]$from which it follows that $\gamma_1 = \gamma_2$only when they both are $1$. But in this case $\Delta_\mu = 0$, contradicting our choice. So $\gamma_1 \neq \gamma_2$. Now when $\gamma_2 =
0$, $\gamma_1 > 0$, so here $\gamma_2 < \gamma_1$. Because they are never equal, and their relationship is continuous, this inequality must hold for any pair satisfying the constraints. But it is easily verified that $\Delta_\Sigma -
\Delta_\mu = 1/\gamma_1 - 1/\gamma_2$whence this quantity is negative. So no pair $\gamma_1,\gamma_2$exists which satisfies the constraints given say $\Delta_\mu = 1$and $\Delta_\Sigma = 2$.

We remark that if instead, one is interested in matching the a posteriori class behavior of a mixture at only the given sample points, then the example's argument does not apply. Indeed it may be verified that this less stringent requirement can be satisfied. This is not entirely surprising since given exactly $d(d+3)/2$sample points, one has three free parameters, and must match only two. This is interesting, and we conjecture that it is true more generally for $d > 1$and $k > 2$- so long as $n =
d(d+3)/2$. In any event, as a result of our arguments at the end of this section, it cannot be true for arbitrarily large $n$. Since we are interested in reparameterizing problems with arbitrarily many samples, we will not consider the matter further in this chapter.

The example above applies to dimension $1$only, and a particular set of sample points. Moreover, its analytical approach does not generalize easily. We now show that in the general case EM-style emphasis is not universal.

Without loss of generality we will assume ${\mu^\prime}_j=0$- since otherwise, the problem, and any solution thereto, can be translated appropriately. We may also assume the sample points are distinct. Eq. 15 becomes:


\begin{displaymath}
\Delta_{\mu_i} = {\Sigma^\prime}_i^{-1}{\mu^\prime}_i
\end{displaymath} (5.12)

where $\Delta_{\mu_i} \triangleq \Sigma_i^{-1}\mu_i -
\Sigma_j^{-1}\mu_j$. Our focus is on the simple rearrangement of Eq. 20:


\begin{displaymath}
{\Sigma^\prime}_i \Delta_{\mu_i} = {\mu^\prime}_i
\end{displaymath} (5.13)

Independent of the ${\Sigma^\prime}_1,\ldots,{\Sigma^\prime}_k$we are free to set ${\mu^\prime}_1,\ldots,{\mu^\prime}_k$so that the $\Delta_{\mu_i}$have arbitrary values. In particular, we will set them so that $\Delta_{\mu_i} = (C C \ldots C)^t$where $C > 1$is a constant.

Now focus on target mixtures such that $\Delta_{\Sigma_i}
\triangleq \Sigma_i^{-1}- \Sigma_j^{-1}$has value $\alpha I$ where $\alpha \rightarrow \infty$. Then ${\Sigma^\prime}_i^{-1}= \alpha I + {\Sigma^\prime}_j^{-1}$. So independent of the choice of particular $\Sigma_1,\ldots,\Sigma_k$, and ${\Sigma^\prime}_j$, each ${\Sigma^\prime}_i^{-1}$, where $i \neq j$, will approach $\alpha I$. There inverses therefore approach the diagonal matrix $1/\alpha I$.

As $\alpha \rightarrow \infty$, we adjust means as necessary so as to maintain $\Delta_{\mu_i} = (C C \ldots C)^t$. Then from Eq. 21 it is apparent that ${\mu^\prime}_i$is very nearly a scaled up copy of the diagonal of ${\Sigma^\prime}_i$. Since the off diagonal elements approach zero, it must eventually be the case that $\Vert{\Sigma^\prime}_i\Vert < \Vert{\mu^\prime}_i\Vert$. Also, since the sample set is finite, it must eventually be that both are smaller than say $1/100$th of the smallest distance between sample points. When this happens, the distance from ${\mu^\prime}_i$to the nearest sample will be at least $\Vert{\mu^\prime}_i\Vert$.

Now the covariance ${\Sigma^\prime}_i$is a convex combination of matrices of the form $(s_\ell - {\mu^\prime}_i)(s_\ell - {\mu^\prime}_i)^t$- and the diagonals are nonnegative. The norm of ${\Sigma^\prime}_i$is at least equal to the norm of its diagonal, but the diagonal is just the distance from the sample point to ${\mu^\prime}_i$. From this it follows that $\Vert{\Sigma^\prime}_i\Vert \geq \Vert{\mu^\prime}_i\Vert$which presents a contradiction, whence EM-style emphasis is not universal. Our arguments above establish:

Theorem 7   In the setting of theorem 6, altered after the fashion of EM so that each covariance $\Sigma_i^\prime$is given by:


\begin{displaymath}
\Sigma_i = \frac{1}{\sum_{j=1}^n \gamma_{i,j}}
\sum_{j=1}^{n} \gamma_{i,j} (s_j-\mu_i)(s_j-\mu_i)^t
\end{displaymath}

there exist target mixtures which may not be generated by emphasis.

We have seen that no number of sample points make EM-style emphasis universal. But given enough points, we can certainly identify a particular class equivalence-class, since all functions involved are analytic with a finite number of parameters. So given enough sample points, a reparameterization must be universal in order to match a target distribution at the sample points only. Now we are interested in reparameterizations which apply given an unlimited number of points. Therefore, unlike the modified emphasis reparameterization introduced in the previous section, EM-style reparameterization is simply not expressive enough to safely reparameterize a posteriori optimization problems.

Concluding Remarks

The a posteriori degeneracy we clarify in section 5.2 for normal mixtures must be added to the list of remarkable characteristics of Gaussians. We have not considered analogues of theorem 5 for other densities.

The reparameterization of section 5.3 is of mathematical interest, and lends some credibility to approaches which attempt to maximize a posteriori probability by adjusting weights on the available samples - perhaps according to some error measure. An examination of reparameterization as a primary optimization technique, represents an interesting area for future work. We must however caution that while our proofs are constructive, we have not considered the matter of numerical sensitivity. Indeed, if one attempts to emulate a mixture using means far displaced from their natural locations, the mixing parameters become quite small. In these cases floating point underflow is a virtual certainty.

One should not interpret section 5.4 to say that EM-style emphasis will not work in practice for many problems. It merely exposes its theoretical limitations.

Acknowledgments

We thank Leonid Gurvits for several helpful discussions - and in particular for pointing out the simple factorization in proof of proposition 5, and simplifying our proof of proposition 6. We also thank Joe Kilian for several helpful discussions which contributed to our understanding of degeneracy and class equivalence.



Next: Conditional Normal Mixtures Up: Topics in Computational Hidden Previous: A Modeling Assembly Language
Peter N. Yianilos
2002-06-24