Towards EM-style Algorithms for a posteriori Optimization of Normal Mixtures

Expectation maximization (EM) provides a simple and elegant approach to the problem of optimizing the parameters of a normal mixture on an unlabeled dataset. To accomplish this, EM iteratively reweights the elements of the dataset until a locally optimal normal mixture is obtained. This paper explores the intriguing question of whether such an EM-style algorithm exists for the related and apparently more difficult problem of finding a normal mixture that maximizes the a posteriori class probabilities of a labeled dataset.

We expose a fundamental degeneracy in the relationship between a normal mixture and the a posteriori class probability functions that it induces, and use this degeneracy to prove that reweighting a dataset can almost always give rise to a normal mixture that exhibits any desired class function behavior. This establishes that EM-style approaches are sufficiently expressive for a posteriori optimization problems and opens the way to the design of new algorithms for them.

Keywords:Expectation Maximization (EM), Normal Mixtures, Gaussian Mixtures, Supervised Learning, Maximize Mutual Information (MMI).

Introduction

A finite normal mixture is a stochastic combination of normal densities. That is, a probability density function

on ${\mathbb{R}}^d$ of the form $p(x) = \sum_{i=1}^k m_i N_{\Sigma_i,\mu_i}(x)$ where $m_1,\ldots,m_k \geq 0$ with $\sum_{i=1}^k m_i = 1$ , and $N_{\Sigma_i,\mu_i}(x)$ denotes the normal density with mean $\mu_i$ and covariance $\Sigma_i$ . Each constituent normal density is referred to as a component of the mixture, and $m_1,\ldots,m_k$ are the mixing coefficients. Normal mixtures have proven useful in several areas including pattern recognition [4] and speech recognition [6,2,5] - along with vector quantization and many others.

The problem of finding a

-component normal mixture

that maximizes the likelihood $\prod_i p(s_i\vert M)$ . of an unlabeled dataset $s_1,\ldots,s_n$ may be approached using the well-known expectation maximization (EM) algorithm [1,3,7]. EM iteratively re-weights each sample's membership in each of the

mixture components by the posteriori probability of the components given the sample.

Normal mixtures are also applied to pattern classification problems. For classification problems, each mixture component may be thought of as corresponding to a pattern class. Given a class label $\omega(s_i)$ for each element

in our dataset, we would like to maximize $\prod_i p(\omega(s_i)\vert s_i,M)$ . Here the goal is to optimize the mixture's a posteriori performance, that is to predict the correct labels, not model the observation vectors themselves. This is sometimes called the maximum mutual information (MMI) criterion in the speech recognition literature [2] and may be viewed as probabilistic supervised learning.

A naive algorithm for this problem segregates the data elements by class, constructs the maximum-likelihood normal density for each class, and combines the resulting densities to form a mixture. This approach can be far from optimal, as illustrated by figure 1. Part (a) of the figure depicts the maximum-likelihood normal densities corresponding to the pair of points in each class. These densities are specified by the unweighted sample mean and unweighted sample covariance of each class in isolation. Although the resulting mixture maximizes the probability of the samples, it does not maximize the conditional probability of the classes given the samples. For example, the density shown in part (b) assigns greater conditional probability to the classes given the samples, although it assigns less probability to the samples themselves. To accomplish this, increased weight is assigned to the outermost two points in part (b). In this example the classes neatly divide and the optimal densities have zero-variance corresponding to placing all of the emphasis on a single element of each class. We view these weights as placing emphasis on the dataset, resulting in a modified collection of normal densities.

**Figure 1:** A simple one-dimensional classification problem. An even mixture formed with the two densities of part (a), each an optimal normal model of a single class, does not perform as well as a mixture formed with the densities of part (b) -- where the performance measure is the probability with which the mixture predicts the correct label. The smaller variance densities perform better because their tails approach zero more rapidly so that less probability is ascribed to the incorrect class.
$\begin{figure}\centering \ \psfig{figure=figs/fourpts.fig.ps,width=3.0in} \end{figure}$

One might certainly imagine one based perhaps on the common machine learning heuristic where one increases the weight of a mistake in order to improve classification accuracy - a technique sometimes referred to as corrective training. But it is not at all clear that emphasis algorithms have the expressive power to discover the solution. The convex reweighting performed by EM cannot, for example, generate a mean vector outside of the dataset's convex hull.

This paper takes the first step towards such an algorithm by bringing the expressive power of emphasis schemes more clearly into focus. Section 2 defines two normal mixtures to be class equivalent if they induce identical a posteriori class functions, i.e. perform identically as probabilistic classifiers. It is shown there that the relationship between mixtures and their class behavior is highly degenerate, and theorem 1 characterizes this degeneracy in a way that is useful for our objective. As a positive result of this degeneracy one can search the entire space of class functions without considering all possible mixtures. So to solve the a posteriori maximization problem above, it suffices to express any normal mixture that induces optimal class functions. We are then led to ask:

A positive answer to this question is given by theorem 2 of section 3. It states that one can almost always exactly match the a posteriori behavior of an arbitrary mixture by an emphasis method that slightly modifies the EM approach. Therefore any optimization problem that depends only on a mixture's induced class functions can be solved using this form of emphasis.

We conclude our introduction by reviewing EM and presenting our view of it as an emphasis reparameterization scheme.

Let $s_1,\ldots,s_n \in {\mathbb{R}}^d$ be a set of observations. Any set of positive weights $\gamma_1,\ldots,\gamma_n$ , gives rise in a natural way to a normal density. That is, the normal density having mean:

The well known expectation maximization (EM) method may be viewed as a process which accepts as input a normal mixture, and outputs an emphasis parameter set - from which an improved normal mixture arises. This cycle continues as the algorithm climbs towards a local maximum, which is a stationary point of the iteration. The global maximum is also stationary point.

More precisely, Given data values $s_1,\ldots,s_n$ and a normal mixture

, one begins by computing $p(\omega_i\vert s_j,M), \forall 1 \leq i \leq k, 1 \leq j \leq n$ . Conceptually the result is a $k \times n$ table of nonnegative values. The rows of this table are then normalized so that their sum is one. Each row then corresponds to a convex combination of the samples. The entries along a row are thought of as weights attributed to the sample. Each improved mean vector $\mu_i$ is merely the corresponding weighted average (convex combination) of the sample vectors. Each improved covariance is the sample convex combination of outer products $(s_j - \mu_i) (s_j - \mu_i)^t$ . The improved mixing parameters are obtained by normalizing the vector consisting of the table row weights prior to their normalization. This process is repeated in an elegant cycle, i.e. each table induces a mixture that gives rise to a new table, and so on.

From our viewpoint, what is interesting here is that it is possible to search for an optimal normal mixture, by searching over the space of values for the

emphasis parameters - rather than using the canonical parameterization consisting of the unknown mean vectors, covariance matrices, and mixing coefficients.

Only a locally optimal solution is found through EM but it is important to realize that the global optimum is a stationary point of the iteration. This means that it can be expressed via emphasis, i.e. that there exists a table which induces the globally optimal mixture. This is interesting because not all mixtures can be induced from the table. In particular, as noted earlier, the induced means must lie within the convex hull of the sample set, and the covariances are similarly constrained.

The Strange a Posteriori degeneracy of Normal Mixtures

This section illuminates certain fundamental aspects of the nature of normal mixtures. Thinking of each mixture component as a class, we focus on the corresponding a posteriori class probability functions. It is shown that the relationship between these functions and the mixture's parameters, is highly degenerate - and that the precise nature of this degeneracy leads to somewhat unusual and counter-intuitive behavior. Even complete knowledge of a mixture's a posteriori class behavior, reveals essentially nothing of its absolute nature, i.e. mean locations and covariance norms. Consequently a mixture whose means are located in a small ball anywhere in space, can project arbitrary class structure everywhere in space.

**Figure:** A depiction of an even mixture of two normal densities in ${\mathbb{R}}^2$ , each with identity covariance. The two classes are separated by a linear decision boundary.
$\begin{figure}\centering \ \psfig{figure=figs/decide.fig.ps,width=3.0in} \end{figure}$

A convenient visualization of class functions, focuses on decision boundaries, i.e. the surfaces along which classification is ambiguous. Imagery like ours in figure 2, and pages 28-31 of [4], suggest an intuitive relationship between mixture component locations, and the resulting a posteriori class structure and decision surfaces. One imagines each mean to be asserting ownership over some volume of space surrounding it. This view is however far from the truth and our theorem 1 reveals that the a posteriori class behavior of normal mixtures is far stranger and counter-intuitive than is generally understood. It shows that knowledge of a mixture's a posteriori class structure (which may be thought of as decision surfaces), tells us essentially nothing about the absolute location of the mixture's means - or the absolute nature of its covariances. One can easily imagine a normal mixture, and picture its corresponding class structure. Now if one is instead given only this class structure, the theorem says that infinitely many normal mixtures are consistent with it - and in particular that the means of such a mixture can be located within an arbitrarily small ball, located anywhere in space.

Despite the popularity and significant role of normal mixtures in statistical modeling and pattern recognition, the precise nature of this fascinating and highly degenerate relationship between class functions and mixtures has received little attention. Part of the reason may be that the simple view in which component means dominate some portion of the space around them is exactly the case when all covariances have identical determinant, and uniform mixing coefficients are used. The most common example of this setting consists of nearest-neighbor Euclidean clustering which corresponds to the identity covariance case. In practice, if the determinants are somewhat comparable, and the mixing coefficients nearly uniform, the simple view is almost always valid. But in many cases, such as when mixture parameters are estimated using an optimization procedure, no such constraint is imposed, and the means, covariance matrices, and mixing coefficients are free to assume any values. Here the full range of strange behavior may be expected.

We now turn to the section's main mathematical development. Examples and discussion follow and the section ends with a technical convergence-rate result that is needed to establish the next sections' main result.

Definition 1 A finite mixture is a probability function on a measure space ${\cal X}$ , arising from

conditional probability functions (components) as follows:

$\begin{displaymath} p(x) = \sum_{i=1}^k p(x,\omega_i) = \sum_{i=1}^k p(x\vert\omega_i) p(\omega_i) \end{displaymath}$

where the constants $\{p(\omega_i)\}$ are referred to as the mixing coefficients. Any such mixture induces a posteriori class functions:

$\begin{displaymath} p(\omega_i\vert x) = \frac{p(x\vert\omega_i) p(\omega_i)}{p(x)} \end{displaymath}$

Two finite mixtures are class-equivalent if they induce the same class functions.

The class functions are not independent since they are constrained by: $\sum_i p(\omega_i\vert x) = 1$ . We begin with a simple proposition, which allows us in what follows to focus on ratios of conditional probabilities, and ignore constant factors. This will be important since the nonconstant portion of the ratio of two normal densities is a rather simple object.

Proposition 1 Let

be a k-component mixture, and $p^\prime(x\vert\omega_1),\ldots, p^\prime(x\vert\omega_k)$ be a collection of strictly positive conditional probability functions. If for some

, there exist constants $C_{i,j}$ such that $\forall i \neq j$ and $x \in {\cal X}$ :

$\begin{displaymath} \frac{p(x\vert\omega_i)}{p(x\vert\omega_j)} = C_{i,j} \cdot \frac{p^\prime(x\vert\omega_i)}{p^\prime(x\vert\omega_j)} \end{displaymath}$

(3)

then there exist mixing coefficients $p^\prime(\omega_1),\ldots, p^\prime(\omega_k)$ such that the resulting mixture $p^\prime(x)$ is class-equivalent to .

proof: We begin by giving a formula for mixing coefficients $p^\prime(\omega_1),\ldots, p^\prime(\omega_1)$ , such that:

Relaxing for the moment the restriction $\sum_{\ell=1}^k p(\omega_\ell) = 1$ , observe that if $p(\omega_j)$ were

, then setting $p(\omega_i) = k_{i,j}, i \neq j$ , would satisfy the equation. Normalization then yields a solution which does sum to

The approach above works so long as $p(\omega_j) \neq 0$ . If it is zero, then

may be treated as a

component mixture, and the proposition follows by induction. Till now

has been fixed and Eq. 5 is established for $i \neq j$ . This equation is however easily extended to any pair $i,\ell$ of indices. Let

denote $p(x\vert\omega_i)p(\omega_i)$ and $f^\prime$ denote $p^\prime (x\vert\omega_i)p^\prime (\omega_i)$ . Then:

and a corresponding expression for $p^\prime (\omega_i\vert x)$ in terms of $\{f^\prime _\ell\}$ . We have already seen that all ratios $f_i/f_\ell = f^\prime _i/f^\prime _\ell$ , so $\sum_{\ell \neq i} f_\ell/f_i = \sum_{\ell \neq i} f^\prime _\ell/f^\prime _i$ whence $p(\omega_i\vert x) = p^\prime (\omega_i\vert x)$ and we are done. $\;\Box$

We now specialize our discussion to mixtures whose components are normal densities.

Definition 2 A

-dimensional

-component normal mixture, is a finite mixture of multivariate normal densities:

$\begin{displaymath} N_{\mu,\Sigma}(x) \triangleq \frac{1}{(2\pi)^{d/2} \Sigma^{1/2}} \cdot e^{-\frac{1}{2} (x-\mu)^t \Sigma^{-1} (x-\mu)} \end{displaymath}$

(6)

The mixture's parameters are then $\Phi = \{\Sigma_1, \ldots, \Sigma_k, \mu_1, \ldots, \mu_k, p(\omega_1), \ldots, p(\omega_k)\}$ .

The following theorem exposes the large degeneracy in the relationship between normal mixtures, and their induced class functions.

Theorem 1 Let

be a

-dimensional normal mixture with

components. For any $x \in {\mathbb{R}}^d$ and $\epsilon > 0$ , there exists a

-dimensional

-component normal mixture $p^\prime$ , such that for $1 \leq i \leq k$ :

$\Vert{\mu^\prime}_i - x\Vert < \epsilon$
$\Vert{\Sigma^\prime}_i\Vert _2 < \epsilon$
$p^\prime$ and are class-equivalent

where $\Vert{\Sigma^\prime}_i\Vert _2$ refers to the Frobenius/Euclidean matrix norm.

proof: Begin by selecting some distinguished component

. By proposition 1 we have only to produce ${\Sigma^\prime}_1,\ldots,{\Sigma^\prime}_k$ and ${\mu^\prime}_1,\ldots,{\mu^\prime}_k$ such that the ratio conditions of the proposition are satisfied. Since constant factors do not matter, we ignore several portions of the definition of a normal density Eq. 6. Namely, the leading constant and the constant portion of the exponent once multiplied out. The result is that:

The proportional ratio condition of proposition 1 then leads immediately to the following necessary and sufficient conditions:

We set ${\mu^\prime}_j = x$ and begin by sketching the rest of the proof. The constraints are satisfied by choosing ${\Sigma^\prime}_j^{-1}$ so that each of its eigenvalues is large. Each resulting ${\Sigma^\prime}_i^{-1}$ must then be positive definite and will also have large eigenvalues. Both ${\Sigma^\prime}_j$ and ${\Sigma^\prime}_i$ then have small norm, satisfying the requirements of the theorem. In this process, it is only ${\Sigma^\prime}_j^{-1}$ we are free to choose. Each choice determines the ${\Sigma^\prime}_i$ , and ${\mu^\prime}_i$ (since we have already set ${\mu^\prime}_j = x$ ). In the limit, as we choose ${\Sigma^\prime}_j^{-1}$ with increasingly large eigenvalues, $\Vert{\mu^\prime}_i - {\mu^\prime}_j\Vert = \Vert{\mu^\prime}_i - x\Vert$ approaches zero whence we can satisfy the theorem's other condition.

Denote by $\overline{\lambda}(A)$ the largest eigenvalue of positive definite matrix

, and by $\underline{\lambda}(A)$ the smallest. For matrices

, it then follows easily from the Cauchy-Schwartz inequality that $\underline{\lambda}(A+B) \geq \underline{\lambda}(A) - \overline{\lambda}(B)$ , by writing $\Vert[(A+B)+(-B)]u\Vert \leq \Vert(A+B)u\Vert + \Vert-Bu\Vert$ where

denotes any unit length vector. Now the first constraint in Eq. 7 may be written:

The parenthesized term is constant since we are given both $\Sigma_i^{-1}$ and $\Sigma_j^{-1}$ , and by subadditivity of symmetric matrix spectral radii, does not exceed their sum $\overline{\lambda}(\Sigma_i^{-1}) + \overline{\lambda}(\Sigma_j^{-1})$ . We can easily choose ${\Sigma^\prime}_j^{-1}$ such that $\underline{\lambda}({\Sigma^\prime}_j^{-1})$ is arbitrarily large, e.g. $c\cdot I$ where

is a large positive constant. It follows then that $\underline{\lambda}({\Sigma^\prime}_i^{-1})$ will also be arbitrarily large, ensuring that it is positive definite.

Next recall that the column norms of a matrix

cannot exceed its spectral radius (operate on each member of the canonical basis). In our positive definite setting each is then bounded above by $\overline{\lambda}(A)$ , so that $\Vert A\Vert _2 \leq \sqrt{d} \overline{\lambda}(A)$ . Choosing the smallest eigenvalue of ${\Sigma^\prime}_j^{-1}$ and ${\Sigma^\prime}_i^{-1}$ to be arbitrarily large, forces $\overline{\lambda}({\Sigma^\prime}_j)$ and $\overline{\lambda}({\Sigma^\prime}_i)$ to be arbitrarily small - satisfying the theorem's first condition.

We set ${\mu^\prime}_j$ to be equal to

from the statement of the theorem, and the second constraint in Eq. 7 may be rearranged to yield:

By our earlier discussion, we may force ${\Sigma^\prime}_i$ to be arbitrarily close to zero - forcing the bracketed term in Eq. 9 to approach zero as well. We next show that the first parenthesized term ${\Sigma^\prime}_i {\Sigma^\prime}_j^{-1}$ tends to the identity matrix

as $\underline{\lambda}({\Sigma^\prime}_j^{-1}) \rightarrow \infty$ . This then demonstrates that ${\mu^\prime}_i \rightarrow x$ satisfying the theorem's second condition, and finishes the proof.

It is easy to see that ${\Sigma^\prime}_i {\Sigma^\prime}_j^{-1} \rightarrow I$ if and only if $({\Sigma^\prime}_i {\Sigma^\prime}_j^{-1})^{-1} = {\Sigma^\prime}_j {\Sigma^\prime}_i^{-1} \rightarrow I$ . Using Eq. 8 this becomes:

To recapitulate, any location for ${\mu^\prime}_j$ , and sufficiently large (in the $\underline{\lambda}$ sense) ${\Sigma^\prime}_j^{-1}$ , will give rise using the proof's constructions, to values for the other means, covariances, and mixture coefficients, such that a class-equivalent mixture results. The proof goes on to establish that in the limit, everything is confined as required within an $\epsilon$ -neighborhood.

**Figure 3:** Two ways to induce a single class function. An illustration of the degeneracy in the relationship between normal mixtures, and their induced class functions. Zero mean normal densities G1 and G2 when mixed evenly induce C1, the *a posteriori* probability of G1. Normal densities G1' and G2' located in the right portion of the graph do not have zero means, but may be mixed (in a far from even way) so that C1 results.
$\begin{figure}\centering \ \psfig{figure=figs/cannon.ps,width=3.5in} \end{figure}$

Figure 3 provides a simple one dimensional example of the normal mixture to class function degeneracy. The left pair of Gaussians G1,G2 both have zero mean. The taller one, G1 corresponds to the first class and has variance

, while G2 has variance 1 and corresponds to the second class. When mixed evenly, the a posteriori probability C1 results. Notice that C1 assumes value

where G1 crosses G2. The right pair of Gaussians G1',G2' have means at

and

, and variances of

and

respectively. An even mixture of them would result in a class function very different from C1. But if very little weight ( $\approx 5.74234\times 10^{-5}$ ) is placed on G1', its mode drops under the graph of G2', and surprisingly, the induced class function is exactly C1. The parameters of this new mixture follow immediately from the calculations within the proof of theorem 1.

Figure 4 depicts another two class problem, this time in two dimensions. Here various elements of each class are arranged so that they may be perfectly separated by a circular decision boundary. It is clear that we can create such a boundary by modeling each class with a normal density centered at the origin. We may evenly mix these components and set their covariance matrices to be multiples of the identity matrix; chosen so that the two density's intersection is the desired circle. This is in some sense the canonical solution but infinitely many others are possible. To begin with, it is not necessary that both covariance matrices be multiples of the identity. It suffices that their difference be such a multiple. It is not necessary that their means be located at the origin. Given a choice of covariance matrices, and the location of one mean, a location for the other may be computed which gives rise to the same boundary. Moreover, by choosing the covariances to be nearly zero, corresponding to highly peaked densities, the two means may be located within an arbitrary $\epsilon$ -neighborhood anywhere in ${\mathbb{R}}^2$ . This illustrates the degeneracy by focusing only on the decision boundary, but theorem 1 shows that the family of class functions may be matched everywhere. The situation in figure 4 was first contrived in an attempt to disprove our emphasis reparameterization theorem, the subject to which we next turn. The data points are arranged along only part of the circle so that no convex combination of them, can possibly express the origin. That is, the origin is outside of the convex hull of the dataset. At that time we thought that this would prevent such combinations from leading to a mixture capable of inducing the desired circular decision boundary. The revelation of theorem 1 is that the component means of such a mixture can be located anywhere.

**Figure 4:** If data points of two colors are imagined to hug the inside and outside of a circular decision boundary, one might think that a Gaussian mixture model capable of separating the classes, would necessarily have component means at the center - but this is not the case.
$\begin{figure}\centering \ \psfig{figure=figs/clock.fig.ps,width=2.5in} \end{figure}$

We have seen that all the ${\Sigma^\prime}_i$ converge to ${\Sigma^\prime}_j$ as the latter tends to zero. At the same time, the ${\mu^\prime}_i$ approach ${\mu^\prime}_j$ . The rate of convergence is our next topic. In particular we will show that $\vert{\Sigma^\prime}_i - {\Sigma^\prime}_j\vert \rightarrow 0$ quadratically as ${\Sigma^\prime}_j \rightarrow 0$ . This is formalized in the following proposition which is needed to establish the main result of the next section. We remark that the convergence of each ${\mu^\prime}_i$ to ${\mu^\prime}_j$ is only linear.

proof: The difference $\Sigma_i^{-1} - \Sigma_j^{-1}$ is constant and is denoted by

. Then:

Now clearly $I+CP^{-1}\rightarrow I$ with $\overline{\lambda}(P^{-1})$ and the eigenvalues of $I + CP^{-1}$ approach unity at this rate. So $(I+CP^{-1})^{-1}\rightarrow I$ as well - also with $\overline{\lambda}(P^{-1})$ . Hence $\Vert I - (I + CP^{-1})^{-1}\Vert = O(\overline{\lambda}(P^{-1}))$ in Eq. 10. But $\Vert P^{-1}\Vert = O(\overline{\lambda}(P^{-1}))$ as well, so $\Vert(P+C)^{-1}- P^{-1}\Vert = O(\overline{\lambda}^2(P^{-1}))$ . $\;\Box$

Emphasis Reparameterization

The question ``Can emphasis induce a mixture class-equivalent to an arbitrary one?'' posed at the beginning of this paper is reasonable since from theorem 1 we know that the means of such a mixture can lie within an arbitrarily small ball located anywhere in space. So confinement to the convex hull of the samples is not necessarily a restriction. Likewise the covariances can be chosen to have arbitrarily small Frobenius norm.

The main result of this section is a qualified positive answer to this question. It states that a modified form of emphasis can, for almost all sufficiently large observation sets, induce a mixture which is class equivalent to an arbitrary one.

Modifications are needed because even though the means and covariances may be chosen with a great deal of freedom, they are still highly constrained by the conditions of Eq. 7 - and it is sometimes impossible to generate a mixture which has satisfies these constraints by the form of emphasis employed by EM. The next section expands on the limits of strict EM-style emphasis.

where the leading normalization has been removed. This allows the scale of the covariance to be adjusted without affecting the mean.

Before stating and proving our reparameterization result, a particular matrix is defined whose nonsingularity is a condition of the theorem.

Condition 1 Denote by

the column vector formed by concatenating

and

, and let

(the number of samples) equal

. Next form square matrix

by combining these columns. For example, if

we have:

$\begin{displaymath} S \triangleq \left( \begin{array}{ccccc} s_{\scriptscript... ...yle{4,2}} & s_{\scriptscriptstyle{5,2}} \end{array} \right) \end{displaymath}$

Our condition is that be nonsingular.

It is not hard to show that this condition is almost always satisfied, a fact formalized by:

proof: Since the matrix consists of monomials of distinct structure, no non-zero linear combination (over the ring of polynomials) of its columns or rows vanishes. So its determinant, as a polynomial, is not identically zero. But the roots of a non-zero polynomial form a set of zero measure. $\;\Box$

Next we simplify our setting, adjust notation, and establish one more proposition before turning to the theorem. Without loss of generality, we may assume that our samples have mean zero. To see this, first let

denote some target mean, where we seek stochastic $\gamma$ values such that $\sum \gamma_i s_i = t$ . We may equivalently solve $\sum \gamma_i s_i^\prime = t^\prime$ , where $s_i^\prime \triangleq s_i - E[S]$ and $t^\prime = t-E[S]$ , since $\sum \gamma_i s_i^\prime = (\sum \gamma_i s_i) - E[S]$ . The problem of expressing a target covariance is unchanged by coordinate translation, since $(s_i - t) = (s_i^\prime - t^\prime)$ . Notice that this is true even if the $\gamma$ are not stochastic. So in what follows, we will drop the prime notation and simply assume that $E[\{s_i\}] = 0$ .

The mixture we generate will have all of its means located near zero, the mean of the sample set. The distinguished mixture component, identified with index

in the previous section, is located exactly at the origin. The location of the other mean vectors, are thought of as small displacements $\delta$ from the origin. We then write $s_i(\delta)$ to mean $s_i - \delta$ , and the

matrix is then parameterized by $\delta$ in the obvious way, and denoted $S(\delta)$ . Note that

is just

as originally defined.

proof: Immediate from the continuity of the $\delta$ parameterization of

. $\;\Box$

Each value of $\delta$ corresponds to a choice of mean vector, and gives rise to a linear mapping $S(\delta)$ . That is, for each choice of $\delta$ , a linear map arises. The proposition tells us that so long as our mixture's means are kept within distance

of the origin, all associated linear mappings are invertible.

The utility of $S(\delta)$ is that it allows us to simultaneously express the objectives of obtaining by emphasis a specified mean vector, and covariance matrix. If $\Gamma$ is a nonnegative emphasis vector, $\delta_t$ is the targeted mean vector, and $\Sigma_t$ the desired covariance matrix, then we must solve the system $S(\delta_t) \Gamma = \Sigma_t : 0$ . Here $\Sigma_t$ is regarded as a vector which is concatenated (denoted ``:'') with the d-dimensional zero vector.

The $\Gamma$ above is nonnegative, but not necessarily stochastic. This may seem to be inconsistent with our definition of the generation of a mean vector by emphasis, because of the normalization it includes. But it is not since $\sum \gamma_i (s_i - \delta) = 0$ is equivalent to $(1/\sum \gamma_i) \sum \gamma_i = \delta$ , which is exactly our definition.

Theorem 2 Given any

-dimensional normal mixture

with

components, and $s_1,\ldots,s_n \in {\mathbb{R}}^d$ such that $n \geq d(d+3)/2$ with some subset of size

satisfying condition 1, then there exists a $k \times n$ table of nonnegative values $\{\gamma_{i,j}\}$ , and nonnegative values $m_1,\ldots,m_{k-1}$ with $\sum m_k \leq 1$ , such that the normal mixture $M^\prime$ generated as described below, is class-equivalent to

The mixing parameters of $M^\prime$ are $m_1,\ldots,m_{k-1},1-\sum_{j=1}^{k-1} m_j$ .
Each mean $\mu_i^\prime$ within $M^\prime$ is given by:

$\begin{displaymath} \mu_i = \frac{1}{\sum_{j=1}^n \gamma_{i,j}} \sum_{j=1}^{n} \gamma_{i,j} s_j \end{displaymath}$
Each covariance $\Sigma_i^\prime$ within $M^\prime$ is given by:

$\begin{displaymath} \Sigma_i = \sum_{j=1}^{n} \gamma_{i,j} (s_j-\mu_i)(s_j-\mu_i)^t \end{displaymath}$

Also, the number of parameters may be reduced to , matching the complexity of the canonical parameterization, by replacing the top $\gamma$ -table row, by a single new parameter $\alpha$ .

proof: Choose some subset of the $\{s_i\}$ of size exactly

that satisfies condition 1. We disregard the other elements entirely, and therefore simply assume that

. As argued earlier, we may also assume without loss of generality that the $\{s_i\}$ have mean zero. Then the distinguished mixture component from theorem 1, with parameters $({\mu^\prime}_j, {\Sigma^\prime}_j)$ , is chosen as follows. Its mean ${\mu^\prime}_j$ is the origin, and its covariance ${\Sigma^\prime}_j$ is proportional to the average of the element self-outer-products, i.e.:

This may be thought of as choosing the first row of the $\gamma$ table to be

, and introducing a new scale parameter $\alpha$ . The table's top row elements are no longer parameters, so the total becomes

table parameters, plus

mixing parameters, plus $\alpha$ - matching the count in the statement of the theorem. As described in the proof of theorem 1, the choice of ${\mu^\prime}_j$ is arbitrary, but ${\Sigma^\prime}_j$ may need to be scaled down to preserve the positive nature of all matrices. The number of resulting parameters in a direct parameterization (i.e. consisting of means, covariances, and mixing parameters), then matches our count.

As $\alpha \rightarrow 0$ we know that the ${\mu^\prime}_i \rightarrow {\mu^\prime}_j = 0$ . Our first step is to choose $\alpha$ sufficiently small, so that the largest $\Vert{\mu^\prime}_i\Vert$ is smaller than the

of proposition 4. Next, $\alpha$ is possibly reduced further until each covariance matrix ${\Sigma^\prime}_i$ arising from Eq. 4 is positive.

Each ${\mu^\prime}$ corresponds to a $\delta$ value in our definition of $S(\delta)$ . The reference mean, located at the origin, corresponds to $\delta=0$ , and by our construction, arises from weighting the $s_\ell$ by non-negative $\gamma$ values - in this case uniform ones. Associated with each ${\mu^\prime}_i$ there is also a ${\Sigma^\prime}_i$ . Recall from our earlier discussion, that our focus is on the equation $S(\mu_i) \Gamma = \Sigma_i : 0$ . Since $S(\delta)$ is non-singular for each of the ${\mu^\prime}_i$ , we know that some vector of $\gamma$ values satisfies this equation. We have only to show that solutions consisting only of nonnegative values can be found.

We will say than a point is representable under some $S(\delta)$ , if its inverse image under this mapping consists only of non-negative values. There exists a representable open neighborhood about $\alpha ({\Sigma^\prime}_j : 0)$ under

, since the inverse image of this point² is the constant vector $\alpha/n$ , and consists of strictly positive values.

Within our bounds on $\delta$ , it uniform-continuously³ parameterizes $S(\delta)$ . Hence there exist values $c^\prime$ and $\epsilon$ , such that for all $\delta$ with $\Vert\delta\Vert \leq c^\prime$ , the open ball about $\alpha ({\Sigma^\prime}_j : 0)$ , with radius $\epsilon$ , is representable under $S(\delta)$ . To avoid introducing another symbol, let $\epsilon$ now denote the maximum such radius.⁴

Next notice that the space of representable points is convex and includes the origin. So representability of $\alpha ({\Sigma^\prime}_j : 0)$ implies representability for all $\alpha$ values. Now if necessary, reduce $\alpha$ further so that all the ${\mu^\prime}$ are within $c^\prime$ of the origin.

Subspace

is the portion of the range, representable under any $S(\delta)$ such that $\delta \leq c^\prime$ .

About $\alpha ({\Sigma^\prime}_j : 0)$ we have seen that there is a representable ball of radius $\epsilon$ which touches the boundary of

. We now denote this radius by $\epsilon(\alpha)$ since we are about to consider its behavior as $\alpha \rightarrow 0$ . By our earlier comments,

includes $\alpha ({\Sigma^\prime}_j : 0)$ , and all proportional vectors, in particular as $\alpha \rightarrow 0$ .

The geometric observation key to our proof is that $\epsilon(\alpha)$ can shrink only as fast as $\alpha$ itself, since this value represents the shortest distance to some point set, in this case

. We imagine

to be a tunnel, leading to the origin while constricting, with the boundary of

forming the tunnel's wall.

Focus now on some ${\mu^\prime}_i$ . While there is a corresponding representable ball about $\alpha ({\Sigma^\prime}_j : 0)$ of radius $\epsilon(\alpha)$ , there is no guarantee that ${\Sigma^\prime}_i$ is within it. As $\alpha \rightarrow 0$ , we have seen that the radius of this ball shrinks linearly with $\alpha$ . But the distance of ${\Sigma^\prime}_i$ from ${\Sigma^\prime}_j$ by proposition 2 shrinks quadratically with $\alpha$ , whence eventually, i.e. for small enough $\alpha$ , ${\Sigma^\prime}_i$ becomes representable. Then $\alpha$ is reduced as necessary for for each component of the mixture, so that afterwards, every ${\Sigma^\prime}: 0$ is representable. $\;\Box$

Other emphasis theorems are possible. In particular it is much easier to establish a similar result in which the weights may be either positive or negative because then the only requirement is that $S(\delta)$ be nonsingular.

The Limitations of Strict EM-style Emphasis

In the previous section we saw that using a particular form of emphasis reparameterization, the a posteriori behavior of any mixture could be matched. We say that such a sample set and emphasis method is universal. This section is by contrast essentially negative, and begins with an example demonstrating that even in one dimension, strict EM-style emphasis is not universal.

Our example is in ${\mathbb{R}}^1$ , and includes two observation points located at

and

. The leading normalization factors in Eq. 2,1 amount to enforcing convex combination, so there is only one degree of freedom given our two element observation set. If $\gamma$ denotes the weight of the first point, then the second has weight $1-\gamma$ . The induced mean is just $\gamma$ and the induced variance $\gamma(1-\gamma)$ . Our objective is to generate a two element mixture which is class equivalent to an arbitrary one, so two emphasis parameters $\gamma_1$ and $\gamma_2$ are needed. The two constant differences to be matched (from Eq. 7) are denoted $\Delta_\Sigma$ and $\Delta_\mu$ . That these are independent and unconstrained characteristics an equivalence class follows from the observation that we may, without loss of generality, set one of the target means to zero. The constraints then become:

Choose $\Delta_\mu > 0$ . From our second constraint we have $\gamma_1 = 1 - 1/[\Delta_mu + 1/(1-\gamma_2)]$ from which it follows that $\gamma_1 = \gamma_2$ only when they both are

. But in this case $\Delta_\mu = 0$ , contradicting our choice. So $\gamma_1 \neq \gamma_2$ . Now when $\gamma_2 = 0$ , $\gamma_1 > 0$ , so here $\gamma_2 < \gamma_1$ . Because they are never equal, and their relationship is continuous, this inequality must hold for any pair satisfying the constraints. But it is easily verified that $\Delta_\Sigma - \Delta_\mu = 1/\gamma_1 - 1/\gamma_2$ whence this quantity is negative. So no pair $\gamma_1,\gamma_2$ exists which satisfies the constraints given say $\Delta_\mu = 1$ and $\Delta_\Sigma = 2$ .

We remark that if instead, one is interested in matching the a posteriori class behavior of a mixture at only the given sample points, then the example's argument does not apply. Indeed it may be verified that this less stringent requirement can be satisfied. This is not entirely surprising since given exactly

sample points, one has three free parameters, and must match only two. This is interesting, and we conjecture that it is true more generally for

and

- so long as

. In any event, as a result of our arguments at the end of this section, it cannot be true for arbitrarily large

. Since we are interested in reparameterizing problems with arbitrarily many samples, we will not consider the matter further in this paper.

The example above applies to dimension

only, and a particular set of sample points. Moreover, its analytical approach does not generalize easily. We now show that in the general case strict EM-style emphasis is not universal.

Without loss of generality we will assume ${\mu^\prime}_j=0$ - since otherwise, the problem, and any solution thereto, can be translated appropriately. We may also assume the sample points are distinct. Eq. 7 becomes:

where $\Delta_{\mu_i} \triangleq \Sigma_i^{-1}\mu_i - \Sigma_j^{-1}\mu_j$ . Our focus is on the simple rearrangement of Eq. 12:

Independent of the ${\Sigma^\prime}_1,\ldots,{\Sigma^\prime}_k$ we are free to set ${\mu^\prime}_1,\ldots,{\mu^\prime}_k$ so that the $\Delta_{\mu_i}$ have arbitrary values. In particular, we will set them so that $\Delta_{\mu_i} = (C C \ldots C)^t$ where

is a constant.

Now focus on target mixtures such that $\Delta_{\Sigma_i} \triangleq \Sigma_i^{-1}- \Sigma_j^{-1}$ has value $\alpha I$ where $\alpha \rightarrow \infty$ . Then ${\Sigma^\prime}_i^{-1}= \alpha I + {\Sigma^\prime}_j^{-1}$ . So independent of the choice of particular $\Sigma_1,\ldots,\Sigma_k$ , and ${\Sigma^\prime}_j$ , each ${\Sigma^\prime}_i^{-1}$ , where $i \neq j$ , will approach $\alpha I$ . There inverses therefore approach the diagonal matrix $1/\alpha I$ .

As $\alpha \rightarrow \infty$ , we adjust means as necessary so as to maintain $\Delta_{\mu_i} = (C C \ldots C)^t$ . Then from Eq. 13 it is apparent that ${\mu^\prime}_i$ is very nearly a scaled up copy of the diagonal of ${\Sigma^\prime}_i$ . Since the off diagonal elements approach zero, it must eventually be the case that $\Vert{\Sigma^\prime}_i\Vert < \Vert{\mu^\prime}_i\Vert$ . Also, since the sample set is finite, it must eventually be that both are smaller than say

th of the smallest distance between sample points. When this happens, the distance from ${\mu^\prime}_i$ to the nearest sample will be at least $\Vert{\mu^\prime}_i\Vert$ .

Now the covariance ${\Sigma^\prime}_i$ is a convex combination of matrices of the form $(s_\ell - {\mu^\prime}_i)(s_\ell - {\mu^\prime}_i)^t$ - and the diagonals are nonnegative. The norm of ${\Sigma^\prime}_i$ is at least equal to the norm of its diagonal, but the diagonal is just the distance from the sample point to ${\mu^\prime}_i$ . From this it follows that $\Vert{\Sigma^\prime}_i\Vert \geq \Vert{\mu^\prime}_i\Vert$ which presents a contradiction, whence strict EM-style emphasis is not universal. Our arguments above establish:

Theorem 3 In the setting of theorem 2, altered after the fashion of EM so that each covariance $\Sigma_i^\prime$ is given by:

$\begin{displaymath} \Sigma_i = \frac{1}{\sum_{j=1}^n \gamma_{i,j}} \sum_{j=1}^{n} \gamma_{i,j} (s_j-\mu_i)(s_j-\mu_i)^t \end{displaymath}$

there exist target mixtures which may not be generated by emphasis.

We have seen that no number of sample points make strict EM-style emphasis universal. But given enough points, we can certainly identify a particular class equivalence-class, since all functions involved are analytic with a finite number of parameters. So given enough sample points, a reparameterization must be universal in order to match a target distribution at the sample points only. Now we are interested in reparameterizations which apply given an unlimited number of points. Therefore, unlike the modified emphasis reparameterization introduced in the previous section, strict EM-style reparameterization is simply not expressive enough to safely reparameterize a posteriori optimization problems.

Concluding Remarks

The a posteriori degeneracy we clarify in section 2 for normal mixtures must be added to the list of interesting characteristics of Gaussians. We have not considered analogues of theorem 1 for other densities but remark that proposition 1 applies more generally. Beta densities, for example, have the property that their ratio is again of the same functional form so that a similar degeneracy arises. It seems likely that results similar in flavor to theorem 1 might be established for this and other related densities.

Another interesting area for future work relates to densities for which there is essentially no degeneracy, i.e. for which the mixture may be identified given its class behavior. A simple example is provided by a mixture of two uniform univariate densities $U_{\theta_1}, U_{\theta_2}$ with support $[\theta_1,1+\theta_1]$ and $[\theta_2,1+\theta_2]$ respectively. Here, a form of degeneracy arises only when $\theta_1 = \theta_2$ . The mixture is still identifiable but the mixing coefficients are not. This illustrates the manner in which mixtures of densities of finite support might be used to construct degeneracy-free mixtures. More complex and interesting possibilities exist.

The reparameterization of section 3 is of mathematical interest, and lends some credibility to approaches which attempt to maximize a posteriori probability by adjusting weights on the available samples - perhaps according to some error measure. An examination of reparameterization as a primary optimization technique, represents an interesting area for future work. We must however caution that while our proofs are constructive, we have not considered the matter of numerical sensitivity. Indeed, if one attempts to emulate a mixture using means far displaced from their natural locations, the mixing parameters become quite small. In these cases floating point underflow is a virtual certainty.

One should not interpret section 4 to say that strict EM-style emphasis will not work in practice for many problems. It merely exposes its theoretical limitations.

Acknowledgments

We thank Leonid Gurvits for several helpful discussions - and in particular for pointing out the simple factorization in proof of proposition 2, and simplifying our proof of proposition 3. We also thank Joe Kilian for several helpful discussions which contributed to our understanding of degeneracy and class equivalence.

Bibliography

About this document ...

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -numbered_footnotes ert.tex

$\displaystyle (P+C)^{-1}- P^{-1}$	$\textstyle =$	$\displaystyle [(I + CP^{-1}) P]^{-1}- P^{-1}$
	$\textstyle =$	$\displaystyle P^{-1}[I - (I + CP^{-1})^{-1}]$	(10)

**Towards EM-style Algorithms for a posteriori Optimization of Normal Mixtures**

Abstract: