A General Decomposition Theorem that Extends the Baum-Welch and Expectation-Maximization Paradigm to Rational Forms

We consider the problem of maximizing certain positive rational functions of a form that includes statistical constructs such as conditional mixture densities and conditional hidden Markov models. The well-known Baum-Welch and expectation maximization (EM) algorithms do not apply to rational functions and are therefore limited to the simpler maximum-likelihood form of such models.

Our main result is a general decomposition theorem that like Baum-Welch/EM, breaks up each iteration of the maximization task into independent subproblems that are more easily solved - but applies to rational functions as well. It extends the central inequality of Baum-Welch/EM and associated high-level algorithms to the rational case, and reduces to the standard inequality and algorithms for simpler problems.

Keywords: Baum-Welch (forward backward algorithm), Expectation Maximization (EM), hidden Markov models (HMM), conditional mixture density estimation, discriminative training, Maximum Mutual Information (MMI) Criterion.

Introduction

Let

be a directed acyclic graph (DAG) with

vertices $v_1,\ldots,v_n$ and

edges $e_1,\ldots,e_m$ , having a single source

and sink

. A nonnegative parameterized weight function is associated with each edge. The value of a source-sink path is the product of the weights along it. The value of the graph is the sum over all source-sink paths, of each path value.

We refer to such a setting as a simple product flow, and the NP-complete problem of maximizing the graph's value over the weight function parameters [21] corresponds to maximum-likelihood parameter estimation for an HMM, mixture density, and other statistical models. We refer to the ratio of two simple product flows that share the same parameter space as a rational product flow, and the maximization problem corresponds to conditional (discriminative) parameter estimation. That is, a rational product flow consists of two distinct graphs. Its value is the quotient of their values regarding each as a simple product flow. Typically the numerator graph is a subgraph of the denominator induced by the labels provided in the supervised discriminative setting.

The edge weights are a function of a parameter space $\Psi$ containing variable members as well as fixed ones. In statistical modeling the variables correspond to mixture element probabilities, symbol output probabilities, state transition probabilities, the parameters of a Gaussian density, or other such model components. The fixed parameters correspond to the data input to the estimation procedure. In these models an individual edge weight is determined by a single model component so that it depends on a subset of the total set of variable parameters - and distinct model components share no variables.

Our product flow setting is an abstraction of these models and an equivalence class on edges associates each edge

with a weight function denoted $w_{c(e)}$ , where

gives the equivalence class index ranging from

for edge

. Each weight function

has corresponding parameters $\Psi_i$ and in general will be associated with many edges. In the statistical models we're abstracting, the value of a model component also depends on the edge itself. So our weight function also accepts the index

of edge

as parameter to capture this dependency. The weight attached to edge

is then written $w_{c(e)}(\Psi_{c(e)}, n(e))$ . For brevity we will write just $w_e(\Psi_e)$ .

The value of a graph is computed in linear time by visiting vertices in topological order and accumulating partial results along the way. This corresponds to the well-known $\alpha$ computation of hidden Markov model evaluation and reestimation[19,12]. The value $\alpha_i$ recorded at each intermediate vertex

corresponds to the sum of all path values from the source to that vertex. The sink

is visited last and its value $\alpha_n$ is that of the entire graph.

Considering the DAG with all edges reversed and performing the same algorithm corresponds to the $\beta$ computation of HMM reestimation. Here each vertex records the value of all paths from it to the sink, and the source's value is that of the entire graph.

Let edge

connect vertex

with

and have weight

. Then $\gamma_e$ is defined as $\alpha_r w \beta_s / \alpha_n$ . This corresponds to the $\gamma$ computation of HMM reestimation. If the edge weights are interpreted as probabilities of transiting an edge, then $\gamma_e$ is the probability that a random source-sink path passes through edge

The Baum-Welch/EM optimization paradigm is iterative. The value of the graph is regarded as a function of parameter set $\Psi^\prime$ , and the goal is to maximize it. The existing parameter set is denoted $\Psi$ and is regarded as a fixed reference during optimization. Based on $\Psi$ each iteration begins with the computation of $\gamma$ values for each edge. Based on these $\gamma$ values,

subproblems are then spawned; one for each weight function.

The mathematics of Baum-Welch/EM tell us that any progress in one or more of these subproblems will strictly increase the value of the graph. The value of a graph is a complex interaction of edge values; each dependent on a weight function and on the edge itself. The fact that the task of optimizing a graph's value can be decomposed so neatly, is in our view the essence of the paradigm. All interactions are, in effect, sufficiently accounted for by the $\gamma$ computation. Commonly used weight functions include discrete probability functions and Gaussian densities. In both cases simple and intuitive closed-form solutions exist for these subproblems - contributing to the popularity of the paradigm.

Our paper extends the paradigm to rational settings consisting of a numerator and denominator graph, where $C_N(\cdot)$ and $C_D(\cdot)$ denote the corresponding edge class functions. Theorem 2 is our main result. Its equation 3 extends equation 1 to the rational case by adding a second term. Our extension cleanly addresses the issue of decomposition, but we know of no simple closed-form solutions for the subproblems that compare with those of Baum-Welch/EM. General functional maximization techniques may be used and specialized techniques represent an interesting area for future work.

There is a long history of developments related to our work. The analysis of [3] marks the generally accepted beginning of the Baum-Welch/EM paradigm and hidden Markov modeling. Essentially the same mathematical idea is found in [9] which introduces EM in the specific context of maximum likelihood mixture density estimation. For completeness we remark that this idea can more generally be viewed as a projection operation with respect to Kullback-Leibler distance [16,7] under Bregman's general framework for convex optimization [6]. This framework is closely related to the later work of I. Csiszár and G. Tusnády [8] and generalizes the much earlier work of [1,18] and also [10,5].

Interest in alternative estimation criteria has grown in recent years and [2] is a notable early milestone in this development. Viewing these more difficult problems in the context of polynomials [11] generalizes Baum-Welch/EM through an embedding that reduces the rational problem to a conventional one. Other work in the area includes the ECM algorithm [17], Hierarchical Mixtures of Experts [15], the CEM algorithm [13], and most recently [14]. Like [11] we consider the problem at an abstract level, but directly derive a different inequality and decomposition result for the rational case.

Jebara and Pentland in [13] (Equation 7) exploit the inequality $\log x \leq x - 1$ to obtain a decomposition result for simple mixture densities, where the rational form corresponds to the a posteriori likelihood of a labeled dataset. Note that throughout this paper natural logarithms are assumed. Our contribution is the application of this same inequality together with the inequality of lemma 3 to give an abstract and general decomposition result. Our result applies to all models we are aware of within the hidden Markov model class - to which mixture densities belong as a simple instance. For mixture densities, our decomposition reduces to theirs.

Our results are presented abstractly, rather than in the specific context of probability modeling. We suggest that the utility of this abstraction is i) to make clear that the result applies beyond conventional models to noncausal constructions or those with penalty functions, and ii) to provide a simple graph-oriented framework that makes it easier to reason about complex models. The origin of this outlook is [20,21] where the development is limited to the simple (nonrational) case.

Rational Decomposition

Lemma 1 Let $X=x_1,\ldots,x_r$ be positive values with sum

, and $X^\prime=x_1^\prime,\ldots,x_r^\prime$ be another vector of positive values. Let

. Then:

$\begin{displaymath} \log{\frac{x_1^\prime+\ldots+x_r^\prime} {x_1+\ldots+x_r}}... ...{x_r}-1\right) }_{\overline{Q}(X,X^\prime)} + N(X,X^\prime) \end{displaymath}$

(2)

where $N(X,X^\prime) \leq 0$ and is equal to zero when $X = X^\prime$ , and $\overline{Q}(X,X^\prime)$ is defined as the bracketed terms above.

proof: From the elementary inequality $\log y \leq y - 1$ with equality at

we have that:

Lemma 2 (Baum et al.) Given the setting of Lemma 1

$\begin{displaymath} \log{\frac{x_1^\prime+\ldots+x_r^\prime} {x_1+\ldots+x_r}}... ...g{\frac{x_r^\prime}{x_r}} }_{Q(X,X^\prime)} + P(X,X^\prime) \end{displaymath}$

where $P(X,X^\prime) \geq 0$ and is equal to zero when $X = X^\prime$ , and $Q(X,X^\prime)$ is defined as the bracketed terms above.

proof: For completeness we recast here the well-known elegant proof originating with [4] into our framework. Let $S' = \sum_{i=1}^r x^\prime$ , and $P^\prime_i = x^\prime_i / S^\prime$ . We must show that $\log{S^\prime/S} = Q(X,X^\prime)+P(X,X^\prime)$ , satisfying the conditions of the lemma. Now:

where $D(P\Vert P^\prime)$ is the Kullback-Leibler distance [16,7] between stochastic vectors

and $P^\prime$ , which is always nonnegative and has value zero given equal arguments. $\;\Box$

The previous two lemmas give opposing bounds and are combined in the following theorem, which is the first step toward decomposition of our rational optimization problem.

Theorem 1 Let $x_1(\Psi^\prime),\ldots,x_r(\Psi^\prime)$ denote a set of positive scalar functions with domain consisting of values $\Psi^\prime$ from some parameter space. Let $y_1(\Psi^\prime),\ldots,y_s(\Psi^\prime)$ be similarly defined. Next let $\Psi$ denote any fixed reference set of parameter values. Finally define $P_{x_i} = x_i(\Psi) / \sum_{i=1}^r x_i(\Psi)$ , and define $P_{y_i}$ similarly. Then:

$\begin{eqnarray*} \log{ \frac{x_1(\Psi^\prime)+\ldots+x_r(\Psi^\prime)} {y_1(... ...Y(\Psi,\Psi^\prime) + C_{XY}(\Psi) + B_{XY}(\Psi,\Psi^\prime) \end{eqnarray*}$

where $B_{XY}(\Psi,\Psi^\prime) \geq 0$ , and is zero when $\Psi^\prime = \Psi$ , and $C_{XY}(\Psi)$ is constant with respect to $\Psi^\prime$ .

where $C_{XY}(\Psi) = \log{[(x_1(\Psi)+\ldots+x_r(\Psi))/ (y_1(\Psi)+\ldots+y_s(\Psi))]}$ , and lemmas 1 and 2 apply to the first two terms giving:

Each term

and

of theorem 1 represents the value of a source-sink path within the DAG. This value is the product of edge contributions along it. To decompose the optimization problem our objective is to rewrite each such product as a sum of edge contributions. This happens naturally and exactly for the

function since it is expressed in terms of logarithms. But the $\overline{Q}$ function is more difficult and we will give a sum decomposition that upper bounds the actual product value and contacts it when all terms are equal.

proof: Since the arithmetic mean is always greater than or equal to the geometric mean we have: $\sum_{i=1}^t \frac{v_i^t}{t} \geq \sqrt[t]{\prod_{i=1}^t v_i^t} = \prod_{i=1}^t v_i$ . $\;\Box$

Theorem 2 (Main Result) Let

denote the number of weight functions that occur in a given rational product-flow problem, and $\ell$ denote the length of the denominator graph's longest source-sink path. Then increasing the value of any of the

separate subproblems below, increases the value of main problem.

$\begin{displaymath} \left\{ \mbox{argmax}_{\Psi_i^\prime} \left[ \sum_{e \in... ...Psi_e)} \right)^{\ell} \right] \right\} , i = 1,\ldots,k \end{displaymath}$

(3)

proof: We assume without loss of generality that all source-sink paths in the denominator DAG are of exactly length $\ell$ . This is justified because an equivalent regularized DAG that has this property can easily be constructed by inserting new dummy vertices, and dummy edges with constant unity weight as needed to stretch any short paths.

Next consider a source-sink path

in the denominator graph and let

denote the set of edges along it. Let the value of this path correspond to a single term

of theorem 1. Then:

Now each edge

in the denominator graph occurs in many source-sink paths. Summing the $P_{y_i}$ over all paths passing through edge

yields $\gamma_e$ . Then observing that the subtraction of $P_{y_i}$ does not affect the optimization yields the additive term:

for each edge. Starting from lemma 2, which is applied to the numerator graph, and summing over all paths containing an edge

yields the additive term $\gamma_e \log w_e(\Psi^\prime_e)$ . Subtracting the denominator term above from this, and grouping by weight function completes the proof. $\;\Box$

Discussion

Using Baum-Welch/EM, multiple observations are dealt with by adding their corresponding

functions. The same is true in our more general rational setting by adding $\overline{Q}$ as well. The DAG depth $\ell$ is then related to the complexity of processing of a single observation. For simple mixture models $\ell=2$ (see explanation below), and for HMMs $\ell=2T$ where

is the number of time series elements comprising a single observation in the training set.

The DAG for a simple mixture density is easily described. Each of the edges outbound from the source select a mixture component, and each edge's weight is the probability of that component. Each edge then leads to an interior vertex with in-degree one. From that vertex a single outbound edge leads to the sink. Its weight is the observation's probability given the corresponding density function. Hence $\ell=2$ for mixture densities since all source-sink paths are of this length.

So our result gives a new form of CEM [13] where both mixture coefficients and density models are optimized in a single step. If they are trained one-by-one, then $\ell=1$ and our decomposition is equivalent to Equation 7 of [13].

Returning to more complex models such as HMMs, we remark that if a single observation is made up of a long series of elements, then $\ell$ can become prohibitively large and reduce the step size that is possible using our results. There are two general strategies that might be used to combat this effect: 1) break-up a single iteration into subiterations that optimize weight functions one-by-one ( $\ell=1$ ), or in groups; in both cases recomputing $\gamma$ values between subiterations, 2) develop an adaptive algorithm that searches for a reduced $\ell$ value that nevertheless allows the optimization to make progress.

At our level of abstraction little can be said regarding convergence. We do know that strictly greater product-flows will result given strict progress in any of the subproblems, so that if the flow value is bounded (as it is for any probability model) then convergence of this value is ensured. But this does not imply convergence of model parameters. Saying more about this requires that something more be assumed about the weight functions themselves, and is beyond the scope of this paper.

It must be said that by extending to the rational case we leave behind a splendid property of conventional Baum-Welch/EM. Namely, that for common weight functions such as normal densities or discrete probability functions, the subproblems resulting from decomposition have globally optimal solutions that are trivially computed in closed form. Still, efficient approaches do sometimes exist. In [13] the authors consider simple mixtures of unnormalized Gaussians, and do not attempt to train all parameters simultaneously. The result is a practical prescription for this setting. General techniques for subproblem optimization is an interesting area for future work, but is beyond the scope of this paper.

Still, our work cleanly settles the issue of decomposition at a level of generality that includes all discrete-state models that the author is aware of within the hidden Markov family. In addition we suggest that our graph-based product-flow outlook makes it easier to reason about complex models and see, for example, that the paradigm's decomposition mathematics apply directly to variations including noncausal models, and penalized constructions that impose soft parsimony constraints such as MDL and MAP.

Acknowledgments

The author thanks Dan Gindikin and Sumeet Sobti for their comments on this manuscript, and Eric Sven Ristad with whom the original DAG-based outlook for simple product flows was developed.

Bibliography

About this document ...

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -numbered_footnotes gbw.tex