Finite Growth Models

Finite growth models (FGM) are nonnegative functionals that arise from parametrically-weighted directed acyclic graphs and a tuple observation that affects these weights. The weight of a source-sink path is the product of the weights along it. The functional's value is the sum of the weights of all such paths. The mathematical foundations of hidden Markov modeling (HMM) and expectation maximization (EM) are generalized to address the problem of functional maximization given an observation. Probability models such as HMMs and stochastic context free grammars are examples that satisfy a particular constraint: that of summing or integrating to one. The FGM framework, algorithms, and data structures describe these and other similar stochastic models while providing a unified and natural way for computer scientists to learn and reason about them and their many variations.

Restricted to probabilistic form, FGMs correspond to stochastic automata that allow observations to be processed in many orderings and groupings - not just one-by-one in sequential order. As a result the parameters of a highly general form of stochastic transducer can be learned from examples, and the particular case of string edit distance is developed.

In the FGM framework one may direct learning by criteria beyond simple maximum-likelihood. The MAP (maximum a posteriori probability estimate) and MDL (minimum description length) are discussed along with the application of FGMs to causal-context probability models and unnormalized noncausal models.

Introduction

Hidden discrete-state stochastic approaches such as hidden Markov models (HMM) for time series analysis, stochastic context free grammars (SCFG) for natural language, and statistical mixture densities are used in many areas including speech and signal processing, pattern recognition, computational linguistics, and more recently computational biology. These are parametric models and are typically optimized using the Baum-Welch, inside-outside, and expectation maximization (EM) algorithms respectively. They share a common mathematical foundation and are shown to be instances of a single more general abstract recursive optimization paradigm which we refer to as the finite growth model framework (FGM) involving non-negative bounded functionals associated with finite directed acyclic graphs (DAG). These functionals need not be probabilities and an FGM need not correspond to a stochastic process.

With FGMs interesting new kinds of stochastic models may be designed, existing models may be improved, settings that are not strictly probabalistic in nature may be addressed, and it becomes easier to design and reason about a wide range of problems. This chapter^2.1introduces and develops the FGM framework and then applies it to:

Graphical models [Pea88,SHJ96] represent a branch of related work in which the independence structure of a set of random variables is the main focus. One can express such models using FGMs - as well as considerably more general settings such as that in which no two variables are independent (their graphical model is fully connected), but they are jointly constrained in interesting ways.

An FGM is a directed acyclic graph with a single source and sink together with a collection of non-negative bounded weight functions $\{w_i\}$ that are associated with edges - one per edge. The value of a source-sink path is the product of the weights along it and the FGM's value is the sum over all such paths of their values. Section 2.2 presents a formal development but we will sketch the basic ideas here.

The weight function $w_i(z_e\vert\Psi_i)$ associated with edge

accepts an argument

and parameters $\Psi_i$ . The objective of FGM optimization is to maximize the FGM's value over its parameters $\{\Psi_i\}$ which are assumed to be independent. The argument

is regarded as fixed. The same weight function may be associated with several edges but will in general assume different values since

is a function of the edge to which it is attached.

If each edge has a distinct associated weight function then optimization is immediately reduced to the task of independently maximizing the weight functions in isolation. The interesting situation arises when this is not the case. Here a change to a single parameter set $\Psi$ may affect weights throughout the DAG. The values of many source-sink paths may be affected and an interaction between members of $\{\Psi\}$ arises, i.e. simple independent maximization is no longer possible.

Surprisingly it is still possible to decompose the optimization problem into certain independent subproblems. This is the key mathematical message of Baum-Welch and EM generalized to the FGM framework. However, even exact solution of the subproblems does not yield a maximum for the FGM. But a new combined parameter set does result that strictly improves it -- unless one has already converged to a limiting value. Iteration is then used to climb to such a limiting value. Unlike gradient descent with line search this approach confidently moves to a sequence of increasingly better points in parameter space and in many cases reduces to a strikingly simple and intuitive algorithm. The cost functions may themselves be FGMs and the computation of an improved parameter set simply depends upon the ability to solve the primitive maximization problems at the bottom of the recursive hierarchy.^2.2

The mathematical idea behind this decomposition begins with a simple equality having to do with the $\log$ of a sum of terms. Let $S = T_1 + \ldots + T_n$ and define proportions

. Notice

. For any nonnegative vector

with unity sum:

Observe that the second term is maximized when

since both are stochastic vectors.

In our setting each term

corresponds to a source-sink path, and its value is not constant, but rather is a function of some parameter set ${\Psi^\prime}$ . Optimizing the FGM is then the problem of maximizing their sum over ${\Psi^\prime}$ . This is the same as optimizing their $\log$ sum; connecting the equality above to the problem of FGM optimization.

Accordingly we parameterize the equality by writing $S({\Psi^\prime}) = T_1({\Psi^\prime}) + \ldots + T_n({\Psi^\prime})$ and $P({\Psi^\prime}) = T_i({\Psi^\prime}) / S({\Psi^\prime})$ where our objective is to maximize

over ${\Psi^\prime}$ . The search for a maximum takes place iteratively. During each iteration we denote the current parameters by $\Psi$ and vary ${\Psi^\prime}$ looking for an improved set.

To express this iterative approach the equality is specialized by defining

to be

evaluated at the current parameters $\Psi$ , i.e. $R \triangleq P(\Psi)$ . Then:

The right term may be ignored because any departure from $\Psi$ will decrease it whereby $\log S$ and therefore

are increased. We then focus on maximizing the left term which corresponds to the

function of the HMM/EM literature.

Elements of the remaining summation correspond to source-sink paths through the FGM's DAG and the proportions $P_i(\Psi)$ are the relative contribution of each path to the FGM's value. Each term $T_i({\Psi^\prime})$ is then a product of of weight functions and breaks up into a sum of individual $\log$ terms. The proportions are then distributed over them and the final step is to group terms by weight function. The problem is then decomposed as required into independent subproblems each involving a single weight function.

The argument above is made in much greater detail in section 2.2 and the formal definition of an FGM presented there is more involved in two respects. First, there are two weight functions, not one, associated with each edge. These are abstractions of the normally separate transition and observation generation functions of a stochastic automaton. Second, the

in our discussion above is defined to be some function of an observation tuple

. A simple such function is that of coordinate projection where, for example, the function's value might be that of a single dimension of

The contribution of section 2.2 is the development of a framework that has nothing directly to do with stochastic models or probability in which the essential ideas of Baum-Welch and EM nevertheless apply. Our view is that the essential content of these ideas is one of decomposability of certain optimization problems that are captured in a somewhat general way by our definition of an FGM.

The early literature, especially theorem 2.1 of [BPSW70], reveals an awareness that the mathematics of HMMs apply beyond strictly probabalistic settings. This direction of generalization does not however appear to have been followed until now. Thinking seems to have been dominated by the motivating problem: the optimization of Markovian stochastic models. Because of the limited computational capabilities of the time and their highly mathematical viewpoint a great deal of emphasis was also placed on the primitive weight functions for which exact maximization may be performed in closed form, and those which give rise to unique global maxima. By contrast we devote very little attention to the the specific characteristics of these primitive functions and view them instead as an abstract types that by assumption support a set of required operations.

In section 2.3 the FGM framework is specialized to the case of probabilistic models. The

above result from projections that select one or more coordinate dimensions such that along each source-sink path each dimension is selected exactly once. Also, the weight functions are restricted to be probability functions or densities. It is not important that the dimensions occur in any fixed order as one follows alternative source-sink routes; nor is it required that the dimensions be selected one at a time. The contribution of this outlook is that FGM-based stochastic models can generate observations in many orderings and groupings - not just one-by-one in sequential order as in HMMs - a capability exploited later in our discussion of stochastic transducers.

Section 2.4 begins by considering the case of infinite stochastic processes. Viewed generatively these models produce output of unbounded size but are used in practice to evaluate the probability of finite observations and are trained using a finite set of finite observations. Restricted in this way they correspond to FGMs that capture their truncated form. Moreover the conventional evaluation and learning algorithms associated with them arise naturally from this viewpoint.

Baum-Welch and EM are today generally regarded as maximum-likelihood parameter estimation techniques. A more modern learning theoretic perspective recognizes the important tension between model complexity and available training data. An important contribution of section 2.4 is the description of a construction that modifies the FGM associated with a stochastic model so that alternative criteria such as MDL or MAP can guide optimization. The optimization is still based on Baum-Welch/EM so we suggest that these should not be viewed as statistical ML techniques and that their broader utility is revealed by our more general functional optimization perspective. The applicability of EM to MAP was first described in [DLR77]. Nevertheless, the parameters of models such as HMMs are still estimated in the ML sense.

The observation functions associated with each state in a conventional Hidden Markov model must satisfy an independence assumption: that their state dependency is limited to the current state (the Markov condition). Section 2.4 observes that conditioning on earlier portions of the observation sequence does not violate this assumption and demonstrates that using the FGM framework, it is easy to see that parameter reestimation for such context dependent HMMs decomposes into primitive context optimization problems. This generalizes Brown's use [Bro87] of the previous speech window to condition generation of the current one. Thus the power of causal context models such as statistical language models may be added to the HMM framework and may lead to improved performance.

The section also presents another illustration that the reach of FGMs extends in a useful way beyond conventional and strictly stochastic models. Causal context models express the probability of an observation as a product of conditional probabilities such that each dimension is predicted once and does appear earlier as a conditioning variable. These restrictions are constrictive in settings such as image modeling where modeling each pixel based on its surrounding context seems natural. Simply relaxing the requirements that probabalistic modeling places on an FGM's source-sink projection functions proves that FGM optimization will improve even noncausal models like the image example above.

Stochastic models describe the manner in which an object is generated. This object is conventionally imagined to be something like an image, an audio recording, or a string of symbols. If instead one models pairs of such objects the focus shifts from describing the nature of an individual to uncovering the relationship between them. The pairs may be thought of as instances of the are similar concept and learning an effective model to describe them is an example of what the authors have called the metric learning paradigm.

Fixing the left element of the pair one might then ask what generative sequences could explain the right. This is the essential idea of

-way stochastic transduction. Section 2.5 shows that a somewhat general automaton-based definition for

-way transduction may be reduced to problems within the FGM framework. One may view speech recognition as a transduction, perhaps in several stages, and the application of learned transducers to this field represents an interesting area for future work.

Perhaps the best known example of

-way transduction is that of string edit distance which is defined as the least costly way to transform one string into another via a given set of weighted insertion, deletion, and substitution editing operations. The development in section 2.5.1 begins with a natural reformulation into stochastic terms first noted by Hall and Dowling in [HD80, P.390-1]. We refer to the result as stochastic edit distance [RY96b].

Our focus is on the interesting problem of learning insert, delete, and substitute costs from a training corpus $(s_1,t_1), \ldots, (s_n,t_n)$ of string pairs. The objective is to minimize the total stochastic edit distance of the collection. In [RY96b] the authors give such a learning algorithm independent of the FGM framework and report on experiments. Until now the costs which parameterize edit distance algorithms have been prescribed not learned. Section 2.5.1 of this chapter describes the reduction of this learning problem to the FGM optimization framework and may be viewed as an adjunct to [RY96b] providing a proof of correctness.

The optimized costs are often quite different from the initial ones. It is in this sense that we learn them. In previous work, the costs have been stipulated after study of the problem domain. Even worse, maximally uninformative unit costs (i.e., Levenshtein distance) are sometimes stipulated when the problem domain is nontrivial. It is now possible to improve any educated guess, or start virtually from scratch from unit costs, to arrive at a set of optimized ones. Algorithms 7 and 8 of section 2.5.1 run in time proportional to the product of the two string lengths, matching the complexity of the conventional edit distance computation. Since edit distance has found widespread application our discovery of a simple way to learn costs may lead to improved performance in many problem areas.

Section 2.5.1 also observes that the notion of string edit distance may be strengthened so that the cost of an edit operation depends not only on its target but also on context. For example, such context-sensitive string edit distances allow the cost of inserting a symbol to depend on the symbol left-adjacent to the insertion site. The result remains an FGM on one therefore the ability to easily improve costs given a training corpus is retained.

Several kinds of hidden discrete-state stochastic models are in widespread use today. Hidden Markov Models (HMM) were introduced in the 1960's [BE67,BPSW70,Bau72] and are typically applied in time-series settings when it is reasonable to assume that the observed data are generated or approximated well by the output of an automaton which changes state and generates output values stochastically. A matrix-based approach is typically used to formalize this notion, and [Por88] and [HAJ90] provide nice overviews. Normal (Gaussian) mixtures find application in statistical pattern recognition where items to be classified are represented as real-valued vectors. Stochastic context free grammars (SCFG) lead to a natural probabilistic interpretation of natural language parsing. Other model types exist, and are easily formed by combining and modifying the standard ones.

The situation is complicated by parameter tying; the practice of reducing the number of parameters in a model by equating two or more different sets of formal parameters to a single set. Observations are sometimes discrete and sometimes continuous. Entire models may be mixed together or recursively combined. Some states or transitions may be nonemitting meaning that no observation is generated. Finally, nonuniform time-series models may predict several future data points. All of these factors make it difficult to write down a single high-level framework with which to reason about and verify the correctness of a model's associated algorithms.

The contribution of FGMs is to move instead to a lower-level representation which is common to all of the models and complicating variations above. As noted earlier isn't really the original model which is represented, but rather its truncation to the given collection of finite observations. By understanding FGMs it is possible to reason about the others as mappings from their individual description language to an FGM given finite observations.

Section 2.6 relates the models above to FGMs by describing a reduction for each. As further illustration it considers the portfolio optimization problem and shows that the simple iterative algorithm reported in [Cov84] for optimization arises by reducing the problem to an FGM.

The work described in this chapter began in 1993 with our efforts to model handwriting. The focus then was on transduction, in an attempt to evaluate and learn a notion for similarity for the off-line case. This effort exposed many limitations of the conventional modeling outlook - problems that extend far beyond handwriting. To address these limitations we had to break several rules of conventional modeling only to ultimately discover that these rules are in many cases unnecessary artifacts of the history that led to HMMs and EM. After three years of generalization, and exercises in specialization, we arrived at the FGM outlook reported here.

Finite Growth Models

In this section we develop the theory of Finite Growth Models (FGM) as a class of nonnegative functionals represented as annotated directed acyclic graphs. The next section focuses on the special case of probability models. Our formal definition of an FGM amounts to an intricate recursive data structure, in the computer science sense, that has been crafted to span a wide range of applications. As a result it may seem at first to be rather abstract and have little to do with stochastic modeling, but each feature is an abstraction of some characteristic required by a problem we considered. It is the mathematical engine at the center of these problems. In particular much of our development and notation parallels that of Hidden Markov Models (HMM).

Let

be a tuple observation with components $x_1,\ldots,x_d$ , and ${\cal X}$ denote the space of all such observations. We will not specify the type of each component since for much of our development it matters only that appropriate functions exists on the space of values that a component may assume. Components may therefore represent discrete values such as letters of the alphabet, continuous values such as real numbers, or other more complex structures. Within the observation tuple, component types need not be the same. In simple time series probability-model settings, components do have the same type, and position within the tuple corresponds to the passage of time.

Definition 1 A finite growth model

for a

-dimensional observation space ${\cal X}$ , is an 8-tuple

, such that:

is a finite DAG with vertices $V=v_1,\ldots,v_n$ having a single source and sink . For $e \in E$ we denote by the index of its source vertex, and by the index of its destination vertex.
is a finite collection $\{M_i\}$ of parameterized nonnegative bounded observation functionals with corresponding parameters $\{\Psi_i\}$ . We write $M_i(y\vert\Psi_i)$ to denote evaluation of the th observation functional, where the domain is not yet specified.
is a finite collection $\{C_i\}$ of discrete-domain parameterized nonnegative bounded choice functionals. The size of the domain of is denoted $\vert C_i\vert$ , and the domain consists of the integers $1,\ldots,\vert C_i\vert$ . The parameters of are denoted $\Upsilon_i$ and we write $C_i(j\vert\Upsilon_i)$ to denote evaluation of the th choice functional.
is a finite collection $\{S_i\}$ of selection functions with domain where the range is not yet specified.
$m: E \rightarrow \{1,\ldots,\vert M\vert\}$ is a function associating an element of with each edge.
is a function that associates each edge with a choice function and a member of its range; which is formalized by denoting the projections of by and requiring:
1. $s(e_1) = s(e_2) \Rightarrow c_1(e_1) = c_1(e_2), \forall e_1,e_2 \in E$ , so that an element of is in effect associated with each nonsink vertex, and;
2. For each nonsink vertex , restricted to $\{e \vert s(e)=v\}$ is 1-1 and onto the range of $C_{c_1(v)}$ .
The choice value corresponding to an edge is then $C_{c_1(e)}(c_2(e)\vert\Upsilon_{c_1(e)})$ and is denoted $C(e\vert\Upsilon)$ for brevity where it is understood that in this context $\Upsilon$ refers to $\Upsilon_{c_1(e)}$ - but the subscript is sometimes written for clarity.
$\varsigma: E \rightarrow \{1,\ldots,\vert S\vert\}$ is an function that associates each edge with a selection function such that for all edges , the range of $S_{\varsigma(e)}$ matches the domain of $M_{m(e)}$ . The observation value corresponding to an edge is then $M_{m(e)}(S_{\varsigma(e)}(x)\vert\Psi_{m(e)})$ and is denoted $M(e,x\vert\Psi)$ for brevity where it is understood that in this context $\Psi$ refers to $\Psi_{m(e)}$ - but the subscript is sometimes written for clarity.

The combined observation and choice parameters are denoted $\Psi$ and $\Upsilon$ respectively, and $\Phi \triangleq (\Psi,\Upsilon)$ denotes the parameters of . The value of a source-sink path is the product of the observation and choice functional values along it. The value $F(x\vert\Phi)$ of the FGM is the sum over all source-sink paths, of the value of each. A choice or observation model whose value is the constant does not affect an FGM's value and is therefore said to be void.

Notice that since an FGM is itself a nonnegative parameterized functional, the definition above may be applied recursively to define an FGM whose constituent functionals are themselves FGMs.

Direct evaluation of an FGM's value by enumerating all source-sink paths is of course not practical. Fortunately $F(x\vert\Phi)$ may be efficiently computed using dynamic programming. To express this precisely requires some notation. Let $v_1,\ldots,v_n$ be a topologically sorted vertex listing. Following the convention of the HMM literature, we define corresponding real variables $\alpha_1,\ldots,\alpha_n$ . The sets of edges into and out from a vertex

are denoted by ${\cal I}(v)$ and ${\cal O}(v)$ respectively. Using the notation above, the contribution of an edge

to the value of a path containing it, is $C(e\vert\Upsilon) \cdot M(e,x\vert\Psi)$ The product of these contributions is the probability of the path. The complete dynamic program is then given by the following simple algorithm:

This may be recognized as the forward step of the Baum-Welch HMM forward-backward algorithm [BE67,Bau72,Por88] adapted to the more general FGM setting. Following execution $\alpha_n$ has value $F(x\vert\Phi)$ .

Algorithm 1 is said to evaluate the FGM using its current set of parameters. The optimization problem we consider seeks a parameter set that maximizes the FGM's value

but we will be content to find in some sense locally optimal solutions, and in some cases to simply make progress. Here

is fixed and it is the FGM's parameters that are varied.

The end-to-end concatenation of two FGMs clearly results in another whose value is the product of the two, so that given a set of observations $x_1,\ldots,x_m$ , the optimization problem

is really just an instance of Eq. 4. Viewing $\{x_i\}$ as training examples the optimization then learns a parameter set, or in the language of statistics and assuming the FGM corresponds to a probability model, estimates its parameters.

Our approach to this problem extends the idea of Baum-Welch to FGMs. Several years following the development of HMMs, essentially the same mathematical ideas expressed in the their algorithm were rediscovered as the expectation maximization (EM) algorithm for mixture densities [DLR77,RW84]. This work focused on parameter estimation for mixture densities, and has had considerable influence on a somewhat independent community. In what follows we will use the phrase Baum-Welch/EM to refer to the adaptation of these results to the FGM setting.

The HMM and EM literature deals with probabalistic models, and Baum-Welch/EM is presented as a method for reestimating their parameters. Our mathematical contribution is the recognition that the method is not essentially a statement about probability but rather should be viewed as an iterative approach to the maximization of arbitrary parameterized nonnegative functionals of a certain form. The form consists of a sum of terms each one of which is a product of subterms. An FGM's DAG is a representation of these forms which, for many with appropriate structure, has the advantages of succinct representation and computational economy. In the interesting case, the subterms are partitioned or grouped into families sharing common parameters. The approach then consists of focusing on each family in isolation thus localizing the problem. Individual family maximization problems are constructed that include certain weighting factors. These factors are constant with respect to the family's optimization but are computed using the entire DAG and are therefore influenced by all families. Thus global information is used to decompose the original problem - localizing it to individual families. If parameters are found that improve any of the individual family problems, the FGM's value will strictly increase. The method is iterative since even exact maximization of every family's problem may not maximize the FGM's value. Instead the process is repeated.

To formalize this idea of simply making progress with respect to a given maximization problem, the notation:

is introduced. Notice that the improvement need not be strict and that the result is a set of values.

Our definition of an FGM is clearly an abstraction of stochastic modeling. We remark that it is possible to start from a definition that is yet more general and in a sense simpler in which only choice models are used and a variable number may be attached to each edge. But then connecting our results to the mainly stochastic problems we are interested in becomes more complex, and it is more difficult to see the connection between our generalization and the original literature.

We therefore develop Baum-Welch/EM for FGMs starting from a lemma that is stated so as to illustrate its correspondence with the EM literature. It is a statement about improper mixture densities, i.e. where the whole and its constituent parts are not necessarily probabilities. This is relevant to FGMs because they may be regarded as immense mixtures of contributions from every source-sink path. We follow earlier lines in its proof without requiring proper densities. The lemma may also be derived starting from theorem 2.1 of [BPSW70] in its most general form.

Lemma 1 (Baum-Welch/EM) Let $f(x\vert\Phi) \triangleq \sum_{i=1}^k g(x\vert\omega_i,\Psi) h(\omega_i\vert\Upsilon)$ be a finite parameterized mixture where $\{g(\cdot\vert\omega_i)\}$ and

are nonnegative parameterized functionals (but not necessarily probability functions or pdfs), $\omega_i$ selects a component of the mixture, and $\Phi \triangleq \{\Psi, \Upsilon\}$ . Also define probability $P(\omega_i\vert x,\Phi) \triangleq g(x\vert\omega_i,\Psi) h(\omega_i\vert\Upsilon) / f(x\vert\Phi)$ . Finally, assume $f(x\vert\Phi) > 0$ . Then:

$\begin{displaymath} \overline{\Phi}\in \mathrm{arg climb}_{\Psi^\prime, \Upsilo... ...ga_i\vert x,\Phi)\log h(\omega_i\vert\Upsilon^\prime) \right) \end{displaymath}$

is an improved parameter set. That is, $f(x\vert\overline{\Phi}) \geq f(x\vert\Phi)$ , with the inequality strict unless $\Phi$ is already optimal. It is understood that $\Psi^\prime, \Upsilon^\prime$ vary over their defined domain.

where $\omega$ is viewed as a discrete random variable distributed as $P(\omega_i\vert x,\Phi)$ and $E_{\omega\vert x,\Phi}$ denotes expectation with respect to it. The point is merely that $f(x\vert{\Phi^\prime})$ is independent of $\omega$ since the former depends on ${\Phi^\prime}$ not $\Phi$ .

Next with $f(x,\omega_i\vert{\Phi^\prime}) \triangleq g(x\vert\omega_i,{\Psi^\prime}) h(\omega_i\vert{\Upsilon^\prime})$ we have $\log f(x\vert{\Phi^\prime}) = \log f(x,\omega\vert{\Phi^\prime}) - \log P(\omega\vert x,{\Phi^\prime})$ since $f(x,\omega\vert{\Phi^\prime}) = P(\omega\vert x,{\Phi^\prime}) f(x\vert{\Phi^\prime})$ . So:

It follows from Jensen's inequality that the second summation is minimized by $\Phi = {\Phi^\prime}$ . This can also be seen by recognizing it as $-[D(\Phi\Vert{\Phi^\prime}) + H(\Phi)]$ , where $D(\cdot\Vert\cdot)$ is the Kullbak-Leibler distance, and $H(\cdot)$ denotes entropy (see [CT91]). Thus maximizing the first term, written $Q(\Phi,{\Phi^\prime})$ in the literature, will surely increase $\log f(x\vert{\Phi^\prime})$ and hence $f(x\vert{\Phi^\prime})$ . But:

and the right side is recognized as the object of the $\mathrm{arg climb}$ operator in the lemma's statement. $\;\Box$

The conditional expectation operator $E_{\omega\vert x,\Phi}$ can cause confusion but we use it to make clear the correspondence with the EM literature. Our discussion in section 2.1 starting with Eq. 1 represents an alternative approach that is free of explicit mention of expectation and probability.

To apply lemma 1 to FGM optimization we introduce two more simple algorithms. The first computes $F(x\vert\Phi)$ by sweeping through the DAG's vertices in reverse rather than forward order, establishing the values of a new set of real variables $\beta_1,\ldots,\beta_n$ :

This corresponds to the backward step of the forward-backward algorithm. After it has run, $\beta_1$ equals $F(x\vert\Phi)$ , and therefore $\alpha_n$ .

We denote by $\omega_1,\ldots,\omega_N$ the source-sink paths through

, and by $E(\omega)$ the set of edges along a particular path $\omega$ , and by $\Omega(e)$ the set of paths that include edge

. The contribution of a single path $\omega$ to the overall value of the FGM is denoted $F(x,\omega\vert\Phi)$ and given by:

Next define $\gamma_\omega \triangleq F(x,\omega\vert\Phi) / F(x\vert\Phi)$ , and then for each edge

define:

Notice that by definition the $\gamma_e$ arise from a partition of the set of all source-sink paths whence it is clear that:

The $\alpha$ and $\beta$ variables may be used to compute these $\gamma_e$ via a third simple dynamic program corresponding to the expectation step of EM:

where either $\alpha_n$ or $\beta_1$ may be substituted for $F(x\vert\Phi)$ . As a practical matter algorithms 3 and 2 may be combined.

We are now in position to state precisely how the objective of optimizing an FGM is decomposed into smaller problems.

Definition 2 An inexact Baum-Welch/EM reestimate $\overline{\Phi}= (\overline{\Psi},\overline{\Upsilon})$ of the current parameters $\Phi=(\Psi,\Upsilon)$ of an FGM

given an observation

is formed by solving two sets of independent optimization problems:

For each the improved parameters $\overline{\Psi}_i$ of observation model are given by:

$\begin{displaymath} \overline{\Psi}_i \in \mathrm{arg climb}_{{\Psi^\prime}_i\V... ...in m^{-1}(i)} \gamma_e \cdot \log M(e,x\vert{\Psi^\prime}_i) \end{displaymath}$
For each the improved parameters $\overline{\Upsilon}_i$ of choice model are given by:

$\begin{displaymath} \overline{\Upsilon}_i \in \mathrm{arg climb}_{{\Upsilon^\pr... ..._1^{-1}(i)} \gamma_e \cdot \log C(e\vert{\Upsilon^\prime}_i) \end{displaymath}$

where the $\gamma_e$ are computed based on $\Phi=(\Psi,\Upsilon)$ . The exact Baum-Welch/EM reestimate is formed by replacing the $\mathrm{arg climb}$ operations with $\mathrm{arg max}$ . By $m^{-1}(\cdot)$ and $c^{-1}(\cdot)$ we mean the set valued inverses of functions and .

The exact form corresponds to the historical definition of the Baum-Welch/EM reestimate. However, since this chapter's focus is on decomposition, and not mathematical issues relating to particular choice and observation models, we will use ``Baum-Welch reestimate'' to refer to our inexact form defined above. This allows our framework to include models for which exact solution is not convenient. Also it is interesting to note that one may choose to not reestimate one or more models, and still improve the overall FGM.

Remark: this theorem's message is one of decomposition. That is, that the subsidiary problems may be considered in isolation, and any progress made will strictly contribute to progress in the overall problem.

which makes clear the correspondence between the FGM and lemma 1. Now by their definitions $\gamma_{\omega_i} = P(\omega_i\vert x,\Phi)$ so from the lemma we have:

The double summations above enumerate every edge along every path though the FGM. To complete the proof we have only to enumerate them in a different order:

and the parenthesized terms are the independent subproblems of definition 2. $\;\Box$

We now change our focus to computational and data structural issues. The first part of our discussion deals with the intricacies of recursion. That is, when an FGM invokes another FGM as one of its observation models. Here, proper algorithm and data structure design can lead to polynomial rather than exponential complexity as illustrated by the case of stochastic context free grammars described in section 2.6.2. We will confine ourselves to finite recursion. The second issue has to do with the arithmetic demands of FGM computation.

The idea behind reducing the complexity of recursive FGM operations is a simple one: avoid duplicate work. If a particular combination of an observation model and selection function occurs multiple times, the combination should only be evaluated once. Since choice models do not depend on the observation, each should be evaluated only once. The computation is then ordered so that when an edge is considered, its choice and observation functions have already been evaluated and their values may be looked up in constant time.

To formalize this idea recall that $M(e,x\vert\Psi)$ is defined as $M_{m(e)}(S_{\varsigma(e)}(x)\vert\Psi_{m(e)})$ . Notice that edge

appears as a function argument. We say $M(e_1,x\vert\Psi) \equiv M(e_2,x\vert\Psi)$ if their definitions are the same. Each equivalence class is referred to as a model application and expressing algorithms in terms of these classes avoids duplicate work. If for two edges $\varsigma(e_1) \neq \varsigma(e_2)$ then by this definition $M(e_1,x\vert\Psi) \not\equiv M(e_2,x\vert\Psi)$ - even if $S_{\varsigma(e_1)}$ is the same function as $S_{\varsigma(e_2)}$ . For this reason we adjust the definition to account for equivalences that may exist among the $\{S_i\}$ .

This adjustment is subtle and we therefore illustrate it by example. Suppose that the observation model associated with an edge

of the top level FGM is itself an FGM

and that selection function

is used by

. Now within

assume that some edge

uses an observation model

with selection function

. Then to evaluate

, selection functions

and

are composed. Now focus on another top level edge

to which an FGM

is attached using selection function

. Within

consider an edge $e_\ell$ using the same observation model

as does

, but in combination with selection function

. To evaluate

here

and

are composed. So

should only be evaluated once if $S_r(S_q(\cdot))$ and $S_v(S_t(\cdot))$ are the same function. This may be the case despite the fact that

and

are not. Thus, whenever possible, equivalence classes should be defined so as to account for possibly equivalent compositions of selection functions. This is easily implemented for the class of projection selection functions introduced in the next section.

Mathematically definition 1 states that each edge has an attached observation model via mapping

. Computationally the edge instead selects (in practice points to) a model application structure which identifies the observation model and the corresponding selection function. We denote the set of observation model applications by $\{{\cal M}_i\}$ . Inclusion within the recursive hierarchy induces a partial ordering on this set where it is important to realize that a given model application may be referenced at more than one recursive level. We extend this ordering by specifying that all primitive observation models are dominated by any FGM. In what follows we will then assume that $\{{\cal M}_i\}$ is indexed in topological order so that the top-level FGM is first and the primitive models form the list's end.

We also introduce choice model application structures $\{{\cal C}_i\}$ . Since choice models do not depend on

or involve a selection function, these structures are in 1:1 correspondence with the choice models themselves and are identically indexed. Many references may exist to a choice model, and its corresponding application structure avoids redundant model evaluation by recording the model's current value. It also accumulates $\gamma$ contributions during expectation step processing so that a single call may be made to maximize the model. This is explained later in greater detail.

If an FGM

is a recursive child of a parent

, one alternative is to eliminate the recursion and expand the child inline within the parent by adding vertices and edges to its DAG. If

is invoked along edge

between vertices

and

then the first step of this inline expansion disconnects

from

and replaces the invocation of

with a void model. The child

is then inserted between the disconnected end and vertex

When evaluating

this inline expansion clearly isn't necessary since one can evaluate

in isolation, record its value in the model application structure, and in constant time refer to that value where needed in

. The same is true for parameter reestimation although a brief argument is needed.

proof: Let

connect vertices

and

and let $\omega$ denote some source-sink path of

and $\gamma_\omega$ its weight. If

is expanded inline the value of $\omega$ will be $\alpha_i C(e\vert\Upsilon) \gamma_\omega \beta_i / F_p(x\vert\Phi)$ . In isolation it is $\gamma_\omega / F_c(x\vert\Phi)$ . But $\gamma_e = \alpha_i C(e\vert\Upsilon) F_c(x\vert\Phi) \beta_i / F_p(x\vert\Phi)$ from which the proposition follows. $\;\Box$

So if the same model application ${\cal M}_i$ is pointed to by many FGM edges, the right computational approach is to first accumulate their $\gamma$ values and then to process ${\cal M}_i$ . Here we emphasize that this approach is more efficient and mathematically identical to inline expansion. So if at the bottom of the recursion the required $\mathrm{arg max}$ problems are solved, then at all higher levels they are too.

An observation model $M_i$

that is not an FGM is said to be primitive, and is represented by an abstract type for which the following operations are defined:

Here

is an element of the range of a selection function and $\Gamma$ is a non-negative weight propagated down as described in proposition 2. The evaluate function returns the value of $M_i$

. Notice that the model's parameters $\Psi_i$ are hidden by the abstract type. In practice a mutator may be provided to set them and a selector provided for readout. The names Estep and Mstep follow the EM literature even though FGMs are not restricted to maximum-likelihood estimation of probability functions and densities.

Given a sequence of values $z_1,\ldots,z_m$ , and nonnegative multiplicities $\Gamma_1,\ldots,\Gamma_m$ the Estep and Mstep procedures address the following optimization problem:

which generalizes Eq. 5. The Estep procedure is called for each $(z_i,\Gamma_i)$ and then Mstep is called to reestimate the model's parameters. For some models the result is the unique optimum parameter set, but in general the process is repeated. For example, if the

are letters of the alphabet, the $\Gamma_i$ are positive frequencies, and the model's value is a probability for each letter, then Estep need only record the $\Gamma$ values provided and Mstep can easily determine the unique optimal model. That is:

We point out that if the probability of symbol

is zero prior to reestimation, then $\gamma_i$ and therefore its reestimate

are zero. Such parameters in this simple discrete model are then effectively disabled since their corresponding choice of alphabet symbol will always have probability zero.

A simple unique solution also exists for the normal density and others. In the discrete example above the model is constrained to produce values which sum to one over the alphabet. Normal densities are constrained to integrate to one. These constraints are obviously important so that the overall optimization problem has a bounded solution. But it is important to observe that their precise nature is irrelevant and invisible to the FGM framework. In particular there is no need for a model to represent a probability function There is also no need to restrict one's attention to models for which unique optimum solutions are readily available. As noted earlier, any progress made for even one model, will translate to progress in the overall FGM optimization.

A choice

is represented by an abstract type for which the following operations are defined:

Here

is the index of an edge and the evaluate function returns its value under

given the model's current parameters $\Upsilon_i$ . The Mstep procedure is passed the vector $(\Gamma_1,\ldots,\Gamma_{\vert C_i\vert})$ and addresses the following optimization problem:

is a probability function then the problem is exactly that of Eq. 6 and an exact solution is available. In both cases the solution is mathematically optimal but other problem constraints might be incorporated into

that sometimes forbid it. For example, one might prohibit probabilities from falling below some fixed threshold. This amounts to a change in parameterization and we will see in a later section that this capability may be used to alter the meaning of optimal in useful ways. Note that we might instead have treated a choice model as an observation model over the natural numbers but decided that their roles are sufficiently distinct to warrant separate designs.

We now return to our discussion of model applications and begin by describing the information that must be associated with these structures. An observation application structure ${\cal M}_i$ contains:

No model variable is needed since indexing corresponds to that of the choice models themselves.

Recall that the mathematical definition 1 of an FGM associates edges directly with elements of

via functions

- there are no model application structures. While the FGM along with its associated model application structure lists might be generated incrementally, it is simpler to imagine that the FGM is first specified mathematically and then compiled into a form that includes these lists. This compiled form is denoted $\hat{F}$ and merges the

sets of each FGM into single densely enumerated sets. In compiled form FGM edges do not point directly to observation models but rather to observation model application structures. The

function is replaced by

to effect this change.

We present evaluation, expectation step, and maximization step algorithms for compiled FGMs. These are not recursive functions but rather operate by traversing the global model application lists built during compilation.

Before any of these are called the FGM must be initialized. This procedure $\mbox{\it initialize}(\hat{F})$ consists of setting the $\Gamma$ variables to zero in all choice model application structures and then evaluating each choice model and recording its value there.

Evaluating an FGM and performing an expectation step are carried out by bottom-up (reverse order) and top-down (in order) traversals respectively of the model application lists. Evaluation returns a single value for $\hat{F}$ given observation

and also records the value of all subsidiary models within the observation model application structures.

where algorithm 1 refers to the appropriate observation and choice application structure value variables rather than computing $C(e\vert\Upsilon)$ and $M(e,x\vert\Psi)$ .

As for primitive observation models the FGM expectation step procedure accepts a $\Gamma$ argument. Supplying a value of

for each

observed corresponds to the optimization problem of Eq. 5. In general these $\Gamma$ values generalize the problem by appearing as exponents to the FGM's value as in $% latex2html id marker 2355 $Eq. \ref{improvemanymult}$$ . The FGM is first evaluated so that the observation model application structures contain current values. Their $\Gamma$ values are then set to zero except for the first which corresponds to the top-level FGM and is set to the function's $\Gamma$ argument. The main loop proceeds top-down to propagate $\Gamma$ contributions to lower observation model application structures. Eventually primitive structures are reached and a primitive expectation step is performed. The FGM's value is returned as a convenience since it was necessary to compute it.

Algorithm 5 ` =13` $\textrm{\^{^\/}}$ M=13 function $\mbox{\it Estep}(\hat{F}, x, \Gamma)$ $\mbox{\it {evaluate}}(\hat{F},x)$ for each ${\cal M}_i$ if , ${\cal M}_{i}\!:\!\mbox{\it$\Gamma$} = \Gamma$ , else ${\cal M}_{i}\!:\!\mbox{\it$\Gamma$}= 0$ for each ${\cal M}_i$ in increasing index order if ``primitive'' $j = {\cal M}_{i}\!:\!\mbox{\it selector}$ $k = {\cal M}_{i}\!:\!\mbox{\it model}$ ${M}_{k}\!:\!\mbox{\it estep}(S_j(x), {\cal M}_{i}\!:\!\mbox{\it$\Gamma$})$ else ``an FGM'' Set $\{\gamma_e\}$ using algorithms 1,2,3 for each edge ${\cal M}_{m_a(e)}\!:\!\mbox{\it$\Gamma$} +\!=\gamma_e \cdot {\cal M}_{i}\!:\!\mbox{\it$\Gamma$}$ ${\cal C}_{c_1(e)}\!:\!\mbox{\it$\Gamma$}[c_2(e)] +\!=\gamma_e\cdot{\cal M}_{i}\!:\!\mbox{\it$\Gamma$}$ return ${\cal M}_{1}\!:\!\mbox{\it value}$

The FGM maximization procedure maximizes the primitive observation models and all choice models. The FGM is then reinitialized.

Observe that we do not specify that the models are maximized in any particular order since if their parameter sets are disjoint so is the act of maximizing them. Our notation thus-far has suggested that the parameter sets are in fact independent but in general they need not be. For example, two models might share the same parameters but compute two different functionals depending on them. In this case the two corresponding optimization subproblems of definition 2 are simply combined into one. From an implementation viewpoint this kind of situation can be handled entirely at the model type level through communication between them that is invisible at higher levels. So the fact that algorithm 6 specifies no order for maximization does not imply that we are limited to the case of disjoint parameter sets.

Analysis of the algorithms above is straightforward from the structure of the compiled FGM $\hat{F}$ . Let

denote the sum over nonprimitive observation model applications, of the number of edges in the FGM corresponding to each. Then both evaluation and expectation step processing require

time where

denotes the time needed to perform the required primitive observation model type operations. These are described by the final entries in the list of observation model applications. When these primitive operations are

time the overall complexity is just

since their number cannot exceed

. Each choice model is evaluated once during initialize. The maximize operation time-complexity is just that required to perform a maximization step for each observation and choice model in the system. Ignoring the internal requirements of the observation and choice models, it is obvious that $\hat{F}$ requires

space.

The algorithms above are conceptually straightforward, but an important numerical issue must be faced in their implementation. As one sweeps through the DAG, the $\alpha$ and $\beta$ variables can easily become too large or small to represent as conventional floating point values. This is true even of the final FGM value. If the FGM is a probability model, exponent-range underflow is the issue and corresponds to an extremely small probability of generating any particular observation. For complex problems this behavior is the rule not the exception since element probabilities in general decline exponentially with dimension. Logarithmic representation is one solution but the authors have also explored a floating point representation with extended exponent range as reported in [RY94]. Because $\gamma$ variables are normalized by

they may be adequately represented using standard floating point.

We now briefly discuss the computational complexity of FGM maximization. A restricted class of FGMs is identified for which maximization is shown to be NP-complete. Members of this class have constant choice functions that assume value

. There are two selection functions

and

that assume constant values

and

respectively. Notice that there is no dependency on any observation. Each observation model has a single parameter bit

and given boolean argument

assumes value $(b \wedge \bar{x}) \vee (\bar{b} \wedge x)$ . The FGM then assumes a bounded integral value and the maximization problem is well-posed. The corresponding decision problem is: does there exist a set of parameter bits such that the FGM's value is greater than a given integer

? This problem is clearly in NP since given a set of parameter bits, FGM evaluation, which runs in linear time, may be used to answer the question. We will refer to this setting as the LFGM (logical FGM) problem.

proof: The NP-complete SAT problem is easily reduced to LFGM by first associating an observation model with each variable. If the variable $v,\bar{v}$ occurs in a clause then selector

is used respectively. Clauses gives rise to trivial FGMs in which each term corresponds to a distinct edge from source to sink with the appropriate observation model and selection function attached. The FGMs for each clause are then concatenated to form the final FGM. The set of clauses are then satisfied if and only if the FGM assumes a value greater than zero. The reduction's complexity is linear whence LFGM is NP-complete. $\;\Box$

It is possible to construct reductions like that above where the observation models are continuous not discrete functions; but further discussion of the complexity of continuous optimization is beyond the scope of this chapter. These continuous formulations correspond to stochastic continuous embedding approaches to SAT and represent an interesting area for future work.

This concludes our development of FGMs as general mixture-forms and in closing we remark that our framework is certainly open to additional generalization and refinement. For example one might introduce choice models that also depend on the observation (discussed later), or replace the

edge association functions with distributions. Another approach is to dispense with choice models altogether since they may be emulated using observation models. Many such variations are possible but our goal was a development somewhere between empty generalization and over specialization.

FGM-based Probability Models

In this section we consider specializations of Finite Growth Models such that their value represents a probability. The associated DAG and attached models may then be viewed as an acyclic automaton directing the stochastic generation of objects. Stochastic modeling is the original motivation for FGMs and much of our nomenclature is drawn from this viewpoint.

The first step is to specialize the selection functions of definition 1 to projections. That is, functions that select some subset of the dimensions of tuple observation

. A projection is characterized by a subset

of the dimension indices of ${\cal X}$ . We associate some subset

with every edge

of the FGM and then denote by $\pi_{g_e}(x)$ , the projection of observation tuple

corresponding to selection of the components in

. This then is the selection function associated with

. For some edge

suppose $g_e = \{2,4\}$ . Then $\pi_{g_e}(x)$ is a tuple with two components which are equal in value to the second and fourth components of

. We assume for now that $g_e \neq \emptyset$ but will later see that this is unnecessary. A composition of projections, applied to

, is again a projection characterized by some dimensional subset; whence it is straightforward to define equivalence for the purpose of identifying truly distinct model applications for recursively combined FGMs. The use of projections corresponds to the generation of observations in arbitrary groupings and orderings - a fact we will rely on later in our discussion of stochastic transduction.

Next we assume that all choice and observation models are probability functions or densities. Each

then corresponds to a stochastic choice among edges to follow. Each $M_i$

corresponds to stochastic generation of some portion of

(since our selectors are projections).

Finally we specialize the

subsets so that along any source-sink path their intersection is void and their union is the complete set of dimensions. Each path then represents one way in which

might be stochastically generated or grown and corresponds to a permutation of the dimensional indices $1,\ldots,d$ . The ``finite growth model'' framework is so named because we imagine that complete observations are grown, starting from the empty set, in many possible orderings, as specified by the model's underlying graph. The value of each source-sink path through the FGM is the probability of following that path and generating

. The FGM's value is then a probability function $P(x\vert\Phi)$ on ${\cal X}$ , i.e. its sum/integral is one. When it is necessary to make clear that we are referring to an FGM with the restrictions above, we will use the term stochastic finite growth model (SFGM).

After algorithm 1 has run, $\alpha_n$ gives $P(x\vert\Phi)$ . However the other $\alpha$ values have meaning as well. In general $\alpha_i$ is the probability of arriving at

and observing that part of

corresponding to the paths between

and the source. Note that because each source-sink path corresponds to a permutation of the observation tuple, every path from the source to

must generate the same observation components -- possibly however in different orders. Dually, $\alpha_i$ is the probability that random operation of the machine encounters

, while at the same time generating those observation components accounted for by the source to

paths. Algorithm 1 sums over all paths through the FGM but is easily modified to find instead a single optimal path. This is referred to as the Viterbii decode of the FGM and answers the question: what is the single most likely explanation for the observation?

Algorithm 2 also computes $P(x\vert\Phi)$ as $\beta_1$ . In general $\beta_i$ is the probability of observing that part of

corresponding to the paths between

and the sink, given passage through

. Dually, $\beta_i$ is the probability that the machine passes through

and proceeds on to generate those observation components accounted for by the

to sink paths. Denoting by

the part of

corresponding to paths from the source to

, and by

the part of

corresponding to paths from

to the sink, $\beta_i = P(R\vert v_i)$ and $\alpha_i = P(L,v_i)$ . But $P(R\vert v_i) = P(R\vert L,v_i)$ because an SFGM is a Markov process. So $\alpha_i \cdot \beta_i = P(L,v_i,R)$ , i.e. the probability of generating

and passing through

Observe that the meaning of $\beta_i$ is not simply a time-reversed interpretation of $\alpha_i$ . This is because an SFGM has an intrinsic directionality. In-bound DAG edge probabilities need not sum to one so simply reversing edge direction does not leave a well-formed SFGM.

For SFGMs the $\gamma_e$ values computed by algorithm 3 are the probabilities $P(e\vert x) = P(e,x) / P(x)$ ; where `

' refers to the event of a transition over edge

. That is, $\gamma_e$ gives the a posteriori probability of following edge

conditioned on observation (generation) of

. Given an SFGM and any

, the $\gamma_e$ values must satisfy the following constraint which can in practice be used as an implementation self-check. It is a specialization of proposition 1 to SFGMs:

proof: The key idea is that the edges that observe a given observation dimension induce a cut in the DAG but we present the proof from a probabilistic viewpoint. Assume $\vert g_e\vert=1, \forall e$ . Because every source sink path path generates each observation component exactly once, the set of edges

which generate a particular component

, corresponds to a set of mutually exclusive and exhaustive events. Hence $\sum_{e \in E_j} P(e\vert x) = \sum_{e \in E_j} \gamma_e = 1$ . There are

observation components, and these partition

into disjoint sets $\{E_i\}$ . So $\sum_{e \in E} \gamma_e = d$ , establishing the first part of the proposition. In general, the $\{E_i\}$ are not disjoint, and the number of sets to which an edge

belongs is $\vert g_e\vert$ . Multiplying each $\gamma_e$ by $\vert g_e\vert$ in effect gives each set its own copy, so that the sum remains

. $\;\Box$

We required above that $g_e \neq \emptyset$ . If necessary this restriction may be effectively removed and one may include so called nonemitting edges in a model. To do so the observation tuple is augmented with dummy components each capable of assuming only a single value. Nonemitting edges then observe these components assigning them probability one. The result is a probability model on the augmented observation space, and since its extension was trivial, on the original space.

The

and

functions of our FGM framework may be used to implement the parameter tying concept from the stochastic modeling literature. When

the observation models attached to these edges are said to be tied. If the edges leaving vertex

map under

to the same choice model as those leaving another vertex $v_2$

, the choice models at

and

are said to be tied. In crafting a model there are two reasons for parameter tying. The first is that it reduces the number of free parameters in the model and as such combats overtraining. The second is that problem domain knowledge may suggest that two models represent the same underlying phenomenon.

Equation 5 formalizes the problem of parameter optimization given a training set $x_1,\ldots,x_m$ by forming a cascade of identical FGMs. More generally we may write:

where now a possibly different FGM is associated with each training observation but all are tied to a single underlying collection of choice and observation models. This is an important conceptual point since it represents the technique used to deal with training examples of varying dimension. For example, each

might represent a finite time series. Typically the FGMs are instances of a single design - varied in order to accommodate the structure of the observation tuple. Equivalently the entire cascade may be regarded as a single FGM operating on a single observation formed by concatenation of the entire training set. This cascading and concatenation is of conceptual value only. In practice one merely performs expectation steps for each pair

followed by a maximization step. There is no need to actually form the cascade and concatenation. In an SFGM setting Eq. 8 represents the training set's likelihood and our goal of maximizing it corresponds to maximum-likelihood parameter estimation of statistics.

Ultimately one is faced with the task of maximizing primitive models. For simple discrete models and many continuous ones (notably normal densities) each independent maximization problem arising from theorem 1 has a unique globally optimal solution and satisfies certain other conditions. Lemma 1 and theorem 1 may then be strengthened to say that strict progress is made unless one is already at a critical point in parameter space. Even this does not imply parametric convergence however. See [BPSW70,DLR77,RW84] for discussion of the convergence and other properties of the EM algorithm. These mathematical issues including rate of convergence, while important, are not our focus and we will not consider them further. Clearly relating various conditions on primitive models to the resulting behavior of the overall FGM represents an important area for future work.

We saw in the last section that simple discrete models are easily maximized. If $M_i$

is a multivariate normal density then maximization is also easily accomplished since the weighted sample mean and weighted sample covariance define the maximum-likelihood parameter set. The weights are $\gamma_e/\Gamma$ where $\Gamma$ denotes the sum of all $\gamma_e$ associated with $M_i$

. In general, any density for which a maximum-likelihood estimate is readily available may be used as an observation model, e.g. beta densities for random variables confined to

. See [HAJ90] for a treatment of HMMs including a compact discussion of the discrete and continuous optimization problems discussed above.

Beyond Finite Probability Models

SFGMs as defined in the previous section are probability functions on an associated finite dimensional observation space. They arise by restricting the FGM framework in several ways. We now relax certain of these restrictions and show that FGMs may be applied to a much broader class of stochastic modeling problems and settings.

Truncated Stochastic Processes

In contrast to an SFGM, stochastic models are frequently defined as either infinite generative processes, or processes that generate finite objects of unbounded size. Time series methods such as hidden Markov models are an example of the first kind and stochastic context free grammars are an example of the second. In the first case the probability of a finite object refers to the aggregated probability of all infinite objects that include it - in the case of strings the combined probability of all infinite strings with a particular finite prefix.

Processes such as these have an associated infinitely deep finite branching structure which is conceptually truncated to a finite DAG that explains observations of some fixed length. This DAG is regarded as part of an FGM whose primitive parameterized models correspond to those of the original process. Given a set of finite observations, a set of corresponding FGMs is constructed as discussed in earlier sections. In this way the parameters of an infinite process that best explain a set of finite observations may be learned using FGMs.

This truncation is conveniently effected within the SFGM framework through the introduction of null observation models. These are simply functionals that assume the value zero everywhere and as such lie trivially within the FGM framework. When generation via the original infinite branching process would result in an observation component beyond the finite set of interest, a void model is attached to that edge and it is redirected to the FGM's sink. In this way the processes infinite graph becomes a finite DAG with a single source and sink. This is illustrated by our discussion of stochastic transducers and hidden Markov models in later sections.

A more subtle issue is addressed by the introduction of void choice models which assume value

for each outgoing edge. They are useful when the events associated with future generation may be partitioned into mutually exclusive sets. Our discussion of stochastic context free grammars in a later section provides illustration.

Alternative Maximization Criteria

Maximum likelihood (ML) parameter estimation is a best-fit strategy and is therefore susceptible to overtraining. For example, if a discrete model never observes a given symbol, its ML parameters assign the symbol probability zero. The result can be poor generalization. Also, our discussion thus far has tacitly assumed that a single stochastic model represented as an FGM is used to explain a set of observations. Here too if this model is very complex in comparison to the available training data, generalization will suffer. The model selection problem consists of somehow choosing or forming a model of appropriate complexity.

These two issues are highly related and we will now briefly describe how both may addressed within the FGM framework. The result is a conceptually clear and computationally tractable approach to implementing more sophisticated estimation schemes for both existing and new stochastic models. That EM could accomplish this in general was observed in [DLR77] but these penalized likelihood functions do not appear to have significantly impacted the use of EM in practice. We consider two related approaches: maximum a posteriori probability (MAP) estimation, and minimum description length (MDL). Our discussion begins with MAP.

Earlier sections have shown that Baum-Welch/EM may be used to address the problem of finding $\Phi$ that maximizes $F(x\vert\Phi)$ . The form of this objective corresponds to maximum-likelihood estimation but we will see that it actually not so limited. Given a prior $P(\Phi)$ on model parameters the MAP objective is to instead maximize $P(\Phi\vert x)$ . From the Bayes' rule it follows that this is the same as maximizing $P(x\vert\Phi) \cdot P(\Phi)$ since this quantity is proportional to $P(\Phi\vert x)$ , i.e. the denominator is fixed. We resolve $\Phi$ into its constituent parts $\{\Psi_i\}$ and $\{\Upsilon_i\}$ corresponding to the FGM's observation and choice models respectively, and define the overall prior based on independent parts:

This product may itself be regarded as a linear FGM denoted

with null observation models, and 1-way choice models having values corresponding to each product term. Alternatively each term may be represented as an observation model that assumes a constant value for each setting of its parameters. The two FGMs

and

are then combined end-to-end, or in the case of multiple observations

is attached to the end of $F_1,\ldots,F_M$ . The FGM

computes $P(x\vert\Phi)$ and

computes $P(\Phi)$ so that the cascade thus formed has value $P(x\vert\Phi) \cdot P(\Phi)$ . Maximizing it will then effect MAP-guided learning.

The minimum description length principle states that one should choose a model and its parameters such that the combined cost of describing these followed by the encoded observations is minimized. The related notion of Kolmogorov complexity deals with the smallest such description and is generally not computationally tractable. By contrast MDL practitioners are content to fix some parameterized scheme of representation and optimize the parameters. This does not fully implement the MDL principle but captures the important tension that must exist between model and data complexity. The cost of describing the observations is $-\log_2 F(x\vert\Phi)$ bits where it is assumed that

is a probability function. The cost of describing the model's structure is upper bounded by choosing some reasonable representation and counting its bits. The cost of describing a real valued parameter may be upper bounded by choosing some number of quantization levels

, assuming uniform quantization (typically), and describing $-\log2 Q$ bits per parameter. Many variations are possible. The final codelength is denoted $L(\Phi)$ and the total description length is $L(\Phi) + -\log_2 F(x\vert\Phi)$ . As in the case of MAP above we assume that $L(\Phi)$ is formed by summing separate contributions from primitive models. As before these may be regarded as a linear FGM where the value of the attached models is just the corresponding codelength exponentiated. Notice that the the MDL and MAP constructions are essentially identical and for this reason we view MDL as a Bayesian approach in which the prior is not necessarily a probability model.

The important message is that the original maximization problem may be decomposed into separate primitive problems even when the guiding criterion is not maximum-likelihood. Any primitive model for which maximization may be conveniently performed subject to the selected criterion, may be used.

Since the basic FGM is fixed in the discussion above, the cost of describing its representation in the MDL framework is also fixed with respect to FGM maximization. However an important part of the MDL outlook is selecting a model of appropriate complexity. This tension is captured in the FGM framework by forming a finite mixture of models over a range of complexities. Each model is made up of a basic FGM augmented with an MDL extension as described above. The cost of describing the model's design now enters the optimization problem since it differs across mixture components. Rather than selecting a single model complexity, reestimation will learn a distribution on complexity. Each mixture component is the a posteriori probability of its corresponding complexity given the observations. Similarly the MAP setting above may be extended to include the learning of a distribution on model complexity. The model with maximum a posteriori probability may be selected if it is important to identify a single best model. Finally, in the case of MDL the pedantic implementor will account for the complexity of the initial mixture in the overall optimization.

The result is not MDL at all in its original sense since a single model and parameter set is not selected. It is more properly regarded as MAP with MDL-inspired priors.

Our discussion above started by placing all penalty terms together at the front of an FGM. In the model selection example, however, some were placed instead following an initial choice. This may be seen as a special case of a general strategy in which penalty terms are placed as deep as possible within the DAG. That is, just before their first use.

The practical significance of this theorem is Baum-Welch/EM may be used to efficiently implement estimation criteria more sophisticated than ML with which it is historically associated - and via FGMs it is clear how this can be done for a wide variety of models.

Alternative Discrete Models

The trivial discrete probability model on an alphabet of

members has

nonnegative free parameters that sum to at most unity. Many other possibilities exist. For example, if alphabet symbols correspond to states in a truncated Poisson process then the Poisson probability function $P(x\vert\lambda)$ may be used. This is true of a choice model as well and has application in the modeling of event durations. In general we remark that the use of nontrivially parameterized choice functions represents an interesting direction that has received only limited attention.

Observation Context Conditioned Probability Models

In our SFGM framework each edge generates some subset of the observation tuple's components such that as one moves from source to sink each is generated once. The values generated depend only on the edge over which generation is transiting. This corresponds to the first-order Markov assumption of stochastic modeling. In particular the value cannot depend upon route followed to reach the edge. But dependence on observation components generated earlier along the path does not violate the Markov assumption. Formally the observation models are then conditional probability functions where the conditioning is on earlier observation components - not earlier edge sequences. Each source-sink path then corresponds to a causal chain of conditional probabilities so that the FGM's value is a probability.

Brown observes in his thesis [Bro87] that the HMM output independence assumption may be relaxed to allow generation of the current speech sample window to depend on the previous one. He then uses Baum-Welch reestimation without reproof. His focus is on minimizing the number of model parameters - and we suspect that it is only for this reason that he did not propose longer context dependencies.

Our contribution is then the formalization of this message in general form within the FGM framework. That is: the problem of optimizing FGMs that employ conditional observation models is reduced to corresponding primitive conditional maximization problems.

Noncausal Unnormalized Models

Given a $k \times k$ pixel real-valued image one may model each pixel position as a function of some surrounding context. If each model's value is a probability and the context-induced dependency graph has no cycles, then the individual pixel probabilities may be multiplied to result in a causal probability model. Causal models are commonly built by fixing a scanning sequence and choosing contexts with respect to the implied temporal ordering. Since there is frequently no single natural scanning sequence the causality restriction may limit model effectiveness by preventing all of a pixel's neighbors from contributing to its prediction.

If the causality restriction is discarded each pixel's model is strengthened but their product is no longer a probability. This corresponds to relaxing the requirement that the projection functions in an SFGM correspond to a permutation of the observation dimensions. Noncausal neighborhood systems have proven useful in image processing [CJ93].

Since reestimation must nevertheless climb we have the result that even a noncausal unnormalized models like that sketched above may be improved within the FGM framework.

Dynamic Choice Function

As remarked earlier we might have allowed choice functions to depend on the observation

. Given choices $c_1,\ldots,c_k$ the FGM may then follow each with probability $p(c_i\vert x)$ . In contrast with the conventional static notion of state transition probability, such choice models are dynamic - responding to the observation. This provides interesting new modeling flexibility. If the portion of

used to influence the choice is generated earlier in the DAG (closer to the sink), then one can build probability models using dynamic choices. We remark, however, that useful models may exist that violate this assumption and therefore generate values that without normalization do not represent probabilities.

Stochastic Transducers

The notion of transduction has its roots in classical formal language theory (e.g. [Ber79]) and represents the formalization of the idea of a machine that converts an input sequence into an output sequence. Later work within the syntactic pattern recognition outlook addressed the induction problem for a particular subclass of finite transducer [OGV93]. More recently a stochastic approach was taken in [BF96] where hidden Markov models were used to capture an input-output relationship between synchronous time series.

Finite growth models can be used to derive Baum-Welch/EM based learning algorithms for stochastic transducers. This interesting class of stochastic models capture in a somewhat general way the relationship between dependent symbol streams. In particular there is no assumption that the streams are synchronous, and the symbols singly or jointly generated during transduction can depend upon all context information, i.e. that in every series involved.

Many problems of pattern recognition and artificial intelligence such as speech or handwriting recognition may be viewed as transductions from an input signal to an output stream of discrete symbols - perhaps via one or more intermediate languages. In this conceptual framework it is interesting to note that speech synthesis is merely transduction in the reverse direction. In this section we first discuss transduction in general terms, and then focus in detail on the simple memoryless transducer that corresponds to the widely-used notion of string edit distance.

Definition 3 A

-way stochastic transducer over alphabets $\Sigma_1,\ldots,\Sigma_k$ is a stochastic automaton

where

denotes its states and

its matrix of transition probabilities - such that associated with each transition

is a probability function

on $\Sigma_1^* \times \ldots \times \Sigma_k^*$ such that $p_e(\epsilon,\ldots,\epsilon) = 0$ where $\epsilon$ denotes the empty string. In the generative view of transduction the machine outputs nonempty

-tuples which are appended to

initially empty output strings. If any of the

values depend not just on

but also on the output generated thus-far, then the transducer is said to have memory. Otherwise it is memoryless. Automaton

is assumed to start from state zero, and the possibility that the alphabets are continuous is left open.

Given observed strings $S=(s_1,\ldots,s_k)$ the transducer

assigns a probability

which represents the sum over generation sequences that generate

, of the probability of each. The FGM formalism provides a natural way to understand and reason about this complicated process and we now describe the reduction of a stochastic transducer and associated finite observation to an FGM.

Denoting the states of

by $\{S_i\}$ the transducer's infinitely deep temporal branching structure is truncated to form an FGM with states $\{S_i^{\ell_1,\ldots,\ell_k}\}$ where each $\ell_j$ is a nonnegative integer and $S_0^{0,\ldots,0}$ is the source. The superscript indices correspond to the number of symbols that have been written to the transducer's

output tapes.

The FGM contains another set of states $\{T_{i,j}^{\ell_1,\ldots,\ell_k}\}$ defined similarly except that the subscript refers to a pair of states in the original automaton. Edges exist between these two state sets but not between members of each. There are edges leaving each state $S_i^{\ell_1,\ldots,\ell_k}$ leading to $T_{i,j}^{\ell_1,\ldots,\ell_k}, j=1,\ldots,\vert S\vert$ . Row

of stochastic transition matrix

provides the corresponding choice model where the matrix entries are the parameters. The observation model attached to each such edge assumes the constant value

Directed edges exist between

states and

states reflecting the writing of

-tuples to the output tape. For each transition

from state

in the original automaton, several FGM edges may be required because the function

may generate output of differing lengths. More formally for all $i_1,\ldots,i_k \in {\mathbb{N}}^k$ an edge exists leaving $T_{i,j}^{\ell_1,\ldots,\ell_k}$ leading to $S_j^{\ell_1+i_1,\ldots,\ell_k+i_k}$ . Each such edge generates observations according to

, but only string tuples with lengths $i_1,\ldots,i_k \in {\mathbb{N}}^k$ are allowed. Void choice models are used for all edges from

Finally positions in the string-tuple observation are enumerated and appropriate projectors serve as the FGMs selection functions and associate observation models with the generation particular positions in the output tape. Notice that along a source-sink path the corresponding projections are disjoint and do cover all positions but do not in general do so in a simple left-right fashion. This completes our reduction of

to an FGM.

We now evaluate the complexity of FGM operations by counting edges. In general each function

may generate output tuples of unbounded total length. But in some cases such as that of string edit distance discussed shortly, the length of the string generated in each tuple position is bounded by some value

denotes the length of the longest string in the observation tuple, the FGM has $L^k \vert S\vert$ states $\{S_i^{\ell_1,\ldots,\ell_k}\}$ and there are $\vert S\vert$ edges leaving each of them leading to

states. The

states number $L^k \vert S\vert^2$ . The out-degree of a

state is clearly

making the total number of edges $O(L^k D^k \vert S\vert^2)$ and this gives the time and space complexity of FGM operations assuming constant time for primitive model operations. The out-degree of a

state is at most

in the FGM so an alternative expression $O(L^{2k} \vert S\vert^2)$ applies even when

has infinite support.

Having reduced

to an FGM we identify several important operations and briefly describe their corresponding FGM computation:

A cascade operation may also be defined where the output of one transducer becomes the input to another. The development above has then established

Note that the space complexity of evaluation and decode is actually smaller since only a forward pass is performed and there is no need to maintain a fully formed DAG in memory. Also, our earlier comments regarding alternative maximization criteria apply to transducers as well and other generalizations are possible including multiple start states and parameter tying.

String Edit Distance

The edit distance between two finite strings

over finite alphabet $\Sigma$ is the least costly way to transform

into

via single-symbol insertions, deletions, and substitutions. The non-negative costs of these primitive edit operations are parameters to the algorithm. These costs are represented as a table $c_{i,j}$ of size $\vert\Sigma\vert+1 \times \vert\Sigma\vert+1$ , where row and column

correspond to an additional alphabet member $\epsilon$ representing the null character. See [SK83] for review or [HD80] for a compact discussion. The entry in table position

gives the cost of substituting symbol

for symbol

. The first row gives insertion costs, the first column gives deletion costs, and the remaining entries specify substitution costs. The table need not be symmetric. Recall that a simple dynamic program finds the edit distance in $O(\vert s\vert\cdot\vert t\vert)$ time.

The task of learning an optimal cost table is clearly under-constrained, because the string edit distance for all strings may be trivially minimized by setting all edit costs to zero. Our outlook is stochastic and we therefore interpret each cost as $-\log_2()$ of the corresponding event probability. For example, an entry in the top row expresses the probability of choosing to insert a particular symbol into the left string. The natural constraint then requires that the probabilities that underly the cost table must sum to one. Expressed in terms of costs $c_{i,j}$ , this translates to:

Rather than imagining an editing process that transforms

into

, we equivalently imagine that the pair

is grown in a series of steps of three kinds: joint steps, left steps, and right steps. This outlook was first introduced in section 3.2.2 of [HD80]. Our contribution is its further development to include the learning of costs and the construction of more sophisticated distance functions as described later in this section. In [RY96b] the authors implement the learning of costs and give experimental results. This report also

The process is formally a 2-way stochastic transducer with a single state, no memory, and

. The initial state of the process is $(\emptyset, \emptyset)$ . A joint step concatenates symbols to both elements of the pair. A right step adds a symbol to only the right element of the pair, while a left step adds a symbol to the left element. In the language of string edit distance, a joint step corresponds to a substitution operation. When a joint step adds the same symbol to both strings, one is substituting a symbol for itself. A right step corresponds to an insertion operation, and a left step corresponds to a deletion operation.

Figure 1 depicts the table now denoted $\chi_{i,j}$ after conversion to probabilities. The zero entries correspond to infinite costs. The stochastic choice between left, right, and joint generation is implicit in this table. That is, $\chi$ combines choice and observation probabilities. For example, the sum of the top row may be interpreted as the probability of choosing to perform a left insertion.

Figure 2 depicts the FGM corresponding to stochastic edit distance between strings of length

and

. Its 3-way branching pattern follows from the structure of the corresponding growth process and matches exactly the dependency structure of the dynamic program for conventional edit distance.

Notice that all but the sink and the bottom and rightmost vertices have ternary out-degree, and that the edges leaving them are drawn with solid lines. These are all in tied. Both their choice and observation models are specified by $\chi$ corresponding to normal operation of the generative stochastic process. Edges drawn with dotted lines implement truncation of the process and employ null models. These might have been depicted instead as individual long edges directly to the sink.

The $-\log_2()$ operation converts the probability to a non-negative ``distance'', which we will later see corresponds closely to the standard notion of edit distance. Also notice that in the definition, we refer to the probability of the ordered pair

. This is an important distinction since $\chi$ need not be symmetric, and it might be that $P((s,t)\vert\chi) \neq P((t,s)\vert\chi)$ .

It is also possible to interpret $P((s,t)\vert\chi)$ with respect to finite event spaces. If we denote by $S_{\vert s\vert,\vert t\vert}$ the FGM state corresponding to the argument string lengths, then the probability defined above may be written: $P((s,t),S_{\vert s\vert,\vert t\vert}\vert\chi)$ . That is, the probability of generating

and

while passing through FGM vertex with coordinates $(\vert s\vert,\vert t\vert)$ . It is straightforward to compute the probability of event $S_{\vert s\vert,\vert t\vert}$ . Division then yields the conditional probability $P((s,t)\vert\chi,S_{\vert s\vert,\vert t\vert})$ which amounts to conditioning on string lengths. It is also possible to define the FGM differently in order to directly implement conditioning on string lengths [RY96b].

**Figure:** Augmented table of edit distance probabilities. This table sums to one, and the corresponding edit costs are $-\log_2()$ of its values.
$\scalebox{0.75}{\includegraphics{sedtab.fig.ps}}$

The stochastic edit distance $-\log_2 P((s,t)\vert\chi)$ is highly related but not identical to conventional edit distance. If stochastic table $\chi$ is converted by $-\log_2()$ to a table of non-negative costs then conventional edit distance corresponds to the log-probability of the single best path through the FGM - i.e. the Viterbi decode. Stochastic edit distance is the sum over all paths. Basing decisions on the Viterbi decode is analogous to focusing on the mode of a distribution while stochastic edit distance corresponds to the mean. In classification systems we expect that both forms will lead to nearly identical decisions. The improved costs we learn in the next section will provably reduce stochastic edit distances, but we also expect that in practice they will also improve results for the conventional form.

We formalize the application of FGMs to stochastic edit distance with the following:

Corollary 1 Let $\chi$ be a table of stochastic edit distance probabilities, and

, $\ldots$ , $(s_{\scriptscriptstyle T}, t_{\scriptscriptstyle T})$ denote a set of string pairs over finite alphabet $\Sigma$ , where $\ell$ bounds the length of any single string. Next let $E_\chi(s,t)$ denote the stochastic edit distance $-\log_2 P((s,t)\vert\chi)$ . Then:

There exists an $O(\vert s\vert\cdot\vert t\vert)$ time, $O(\min(\vert s\vert,\vert t\vert))$ space algorithm to compute $E_\chi(s,t)$ - matching the complexity of the conventional dynamic program.
There exists an $O(\ell^2 T)$ time and space algorithm for producing a revised cost table $\chi^\prime$ using Baum-Welch/EM such that:

$\begin{displaymath} \sum_{i=1}^T E_{\chi^{\prime}}(s_i,t_i) \leq \sum_{i=1}^T E_\chi(s_i,t_i) \end{displaymath}$

where the inequality is strict unless $\chi$ is a critical point.

The FGM formalism is useful to arrive at a correct solution but because of the extensive parameter tying employed in our construction, the associated computations may be greatly simplified. There is no need to build actual DAGs during the computation. Also, as remarked earlier, the choice model is implicit in $\chi$ so that creation of separate choice and observation models is unnecessary when computing $E_\chi(s,t)$ . The probability of passing over an edge is just the corresponding entry in $\chi$ .

The result is algorithm 7. It requires an $(\vert s\vert+1) \times 2$ array of $\alpha$ values. We write $\alpha_{i,j}$ to denote an element, where $1 \leq i \leq \vert s\vert+1$ and

. This array corresponds to the vertices within two adjacent columns of the FGM as depicted in figure 2. The logarithm $E_\chi$ of the joint probability of

and

is returned rather than the probability itself, so that the caller need not deal with extended range floating point values. As observed by [HD80], the algorithm's structure resembles that of the standard dynamic program for edit distance.

As in the case of evaluation, there is no need to create separate choice and observation models when implementing reestimation. A brief argument establishes this fact. Assume we do have separate models. Then each edge

is of type left, right, or joint, and during Baum-Welch/EM the quantity $\gamma_e$ will be added to a choice accumulator and also to an accumulator corresponding to the observed alphabet symbol or symbols. So the total $\gamma$ contribution to each of the three choice accumulators will exactly match the contribution to its corresponding observation model. Hence, if we instead add the $\gamma$ values directly into an initially zero array ${\overline{\chi}}$ , the total contributions to the left column, top row, and interior will exactly correspond to the choice accumulator contributions above. So after normalization, ${\overline{\chi}}$ will represent the same model as that resulting from reestimation based on separate models.

Algorithm 8 is the central routine in the stochastic edit distance learning process. It is called repeatedly with pairs of strings. The current cost table $\chi_{i,j}$ must be initialized before the first call and the improved cost table ${\overline{\chi}}$ must be set to zeros. As each pair is processed, non-negative values are added to locations within ${\overline{\chi}}$ . After all the pairs have been presented the ${\overline{\chi}}$ array is simply normalized so that the sum of its entries is one, and then represents the set of improved costs. This entire is repeated to further improve the model. That is, ${\overline{\chi}}$ is copied to $\chi$ and ${\overline{\chi}}$ is again set to zeros. The sequence of training pairs is then presented again, and so on.

The algorithm uses two work arrays, $\alpha_{i,j}$ and $\beta_{i,j}$ , corresponding to the algorithms 1 and 2 respectively. It is assumed that these arrays are large enough to accommodate the training pairs. They need not be initialized by the caller. As in algorithm 7 it is possible to reduce the size either $\alpha$ or $\beta$ but not both. Doing so results in a constant 2:1 space savings but for clarity we present the algorithm using complete arrays.

Algorithm 8 ` =13` $\textrm{\^{^\/}}$ M=13 procedure $\mbox{\it LearnCosts(s, t)}$ for $i = 1,\ldots,\vert s\vert+1$ for $j = 1, \ldots, \vert t\vert+1$ if $i > 1 \vee j > 1$ $\alpha_{i,j} = 0$ else $\alpha_{1,1} = 1$ if $\alpha_{i,j} +\!=\alpha_{(i-1),j} \cdot \chi_{s_{(i-1)},0}$ if $\alpha_{i,j} +\!=\alpha_{i,(j-1)} \cdot \chi_{0,t_{(j-1)}}$ if $i > 1 \wedge j > 1$ $\alpha_{i,j} +\!=\alpha_{(i-1),(j-1)} \ \cdot \chi_{s_{(i-1)},t_{(j-1)}}$ for $i = \vert s\vert+1,\ldots,1$ for $j = \vert t\vert+1, \ldots, 1$ if $i \leq \vert s\vert \vee j \leq \vert t\vert$ $\beta_{i,j} = 0$ else $\beta_{(\vert s\vert+1),(\vert t\vert+1\vert)} = 1$ if $i \leq \vert s\vert$ $\beta_{i,j} +\!=\beta_{(i+1),j} \cdot \chi_{s_{i},0}$ ${\overline{\chi}}_{s_{i},0} +\!=(\alpha_{i,j}\cdot \chi_{s_{i},0}\ \cdot \beta_{(i+1),j}) / \alpha_{(\vert s\vert+1),(\vert t\vert+1)}$ if $j \leq \vert t\vert$ $\beta_{i,j} +\!=\beta_{i,(j+1)} \cdot \chi_{0,t_{j}}$ ${\overline{\chi}}_{0,t_{j}} +\!=(\alpha_{i,j}\cdot\chi_{0,t_{j}} \ \cdot \beta_{i,(j+1)}) / \alpha_{(\vert s\vert+1),(\vert t\vert+1)}$ if $i \leq \vert s\vert \wedge j \leq \vert t\vert$ $\beta_{i,j} +\!=\beta_{(i+1),(j+1)} \ \cdot \chi_{s_{i},t_{j}}$ ${\overline{\chi}}_{s_{i},t_{j}} +\!=(\alpha_{i,j} \ \cdot \chi_{s_{i},t_{j}}\cdot\beta_{(i+1),(j+1)}) \ / \alpha_{(\vert s\vert+1),(\vert t\vert+1)}$

In algorithm 8 we remark that one should escape after the $\alpha$ computation in the event of a zero probability observation. Also, the self-check of proposition 3 has a particularly simple interpretation here: if one accumulates all contributions to ${\overline{\chi}}$ , doubling those corresponding to substitutions, the result is exactly $\vert s\vert+\vert t\vert$ .

We now illustrate by example that one has no guarantee of convergence to a global optimum. Let

and

provide the sole training pair. If the probability of substitutions

and

are

, and that of left-inserting

is also

, and left-inserting

has probability

, then learning converges to a local optimum in which these operations have probability

, and

respectively. The algorithm has in effect chosen to edit

into

by substituting

, then

and finally left-inserting

. At convergence the pair's distance is

bits. But it is clearly better to left-insert

and perform two

substitutions. Initializing the edit operations near to this objective leads learning to the desired solution and the pair's distance is reduced to about

bits. Initializing the four parameters to

yields a third local maximum giving distance $3.92$

bits.

String edit distance is a very simple example of stochastic transduction. Using it to ground our discussion we now return to more general observations.

The costs employed in the edit distance computation depend only on the targets of the edit operations. They are independent of the context in which the edit operation occurs. Yet in most naturally occurring edit processes the likelihood of an edit operation depends strongly on the context in which the operation occurs. Within the FGM framework one may strengthen the model for edit distance so that the cost of an edit operation depends on the context in which that operation is performed. Insertion and deletion costs can depend on one or more positions of trailing context in a single string while substitution costs can depend on a joint context constructed from both strings. This modeling technique should lead to more powerful similarity functions and is applicable to transducers in general.

**Figure 2:** Finite growth model DAG corresponding to computation of the edit distance between strings of length 4 and 5.
$\scalebox{0.75}{\includegraphics{figs/sedfgm.fig.ps}}$

Context models may also be built by encoding context into the state space. To do this one limits observation models to trivial probability functions that assign probability

to a single alphabet member. After passing over an edge one then has certain knowledge of the most recent symbol. Beyond this the graph can be designed so that arrival in a state implies exact knowledge of a trailing finite context of any desired length. In effect the modeling work is moved entirely to the choice models. Having arrived at a vertex the choice model predicts the next symbol conditioned on the past. Note that this technique is an example of something that does not generalize to continuous observations.

To specify simple stochastic edit distance, we must learn only a single $\chi$ table. With contexts, many more free parameters are added to the model. In principle very large context models can be built, but in practice their size is limited by the amount of training data available increasing the need for MDL or MAP guided optimization.

The learning of a

-way transducer's parameters represents one formalization of what the authors have called the metric learning paradigm. The training set may be regarded as positive examples of the are similar concept and after learning, the negated logarithm of the FGM's joint evaluation or conditional evaluation is viewed as a measure of dissimilarity. The mathematical metric axioms are not in general satisfied so by ``metric learning'' we really mean any learned quantification of similarity or dissimilarity.

This idea applies beyond string settings and for processes more complex than our definition of transduction which specifies that a tuple observation is grown from left to right by concatenation. The string edit distance DAG need not be viewed as arising from

-way transduction. The salient characteristic that implements metric learning is that some edges observe both strings while others see only one. It is not essential that as one moves from source to sink the string positions considered within each tuple position also move from left to right.

It is then clear that distances analogous to string edit distance may be defined for objects whose natural growth description is more complex than the linear DAG that corresponds to a string. We do not fully develop this direction but now make precise the corresponding DAG structure.

In the case of edit distance each FGM vertex faces three choices. Two choices correspond to single symbol generation, and a third choice selects joint generation. One may view the resulting graph as a product of two linear FGMs, each of which grows a single string from left to right. We now generalize this notion. Given two FGMs

we define the generation DAG of the product FGM denoted $F_A \otimes F_B$ as follows:

It is readily verified that $F_A \otimes F_B$ has a single source and sink, and that each path corresponds to a permutation of the joint observation tuple so that after specifying the necessary choice and observation models, a well-formed SFGM results.

New Perspectives on Existing Models

In this section we relate FGMs to three models in widespread use: hidden Markov models, stochastic context free grammars, and normal mixtures. The problem of optimizing the $\log$ return of a fixed portfolio is also shown to correspond to a trivial FGM. We focus on reduction of these problems in standard form to FGMs but remark that any of them may be improved by using MDL or MAP guided optimization as described earlier.

Hidden Markov Models

A hidden Markov model (HMM) is an automaton that changes state stochastically after generating an observation. See [Por88,HAJ90] for reviews. Its state transitions are given by a stochastic $n \times n$ matrix

, a vector $\pi$ gives the probability of starting in each of its

states, an output function

is associated with each state

, and there are one or more designated final states. As time passes the machine changes states and generates observations which form an output time series.

An HMM is essentially a

-way stochastic transducer restricted so that all transitions from a state specify the same observation model and generate a single observation component (

). If there are more than one possible starting states then the entire corresponding FGM is preceded by a choice that plays the role of vector $\pi$ . The edges leading to the sink are weighted one or zero according to membership in the set of final states.

Because output distributions are associated with states not edges in an HMM, every FGM edge leaving a state is associated with the same distribution. For this reason the transducer reduction can be simplified by eliminating the

states whose only purpose is to remember the previous state. So if the HMM states are denoted $S_1,\ldots,S_n$ then the FGM states are just $\{S_i^t\}$ where the superscript corresponds to time.

This reduction of an HMM to an FGM amounts to the simple unfolding in time of the HMM's operation and is therefore done with no loss of efficiency. In some sense it is a simpler representation of the HMM because the temporal order of events has been made explicit.

Reduction in the other direction is problematic. Given a finite alphabet it is not hard to construct an equivalent HMM, i.e. one that imposes the same extrinsic probability distribution. But the canonical such machine may have exponentially more states because an FGM can generate observations in arbitrary order. Moreover, since an FGM can generate observations jointly, the standard HMM formulae for reestimation do not apply. Finally, an FGM that generates joint continuous observations will correspond to an HMM only if these joint densities are trivial, i.e. a product of independent terms. This suggests that FGMs may be more expressive than HMMs.

We now discuss an important variation: time duration HMMs [Lev86,RC85] and sketch their rederivation in FGM terms illustrating the utility of parameterized choice models. See [HAJ90] pages 218-221 for a brief review. If in a conventional HMM a state transitions to itself with probability

, then duration in that state is geometrically distributed, i.e. as $p^{t-1}(1-p)$ . This functional form restricts the model's ability to match the characteristics of some applications such as phoneme recognition.

In an ordinary HMM, diagonal transition matrix entries give the self-transition probabilities. These will be modeled separately so

is assumed to have a zero diagonal. The reduction of a time duration HMM to an FGM starts with the same state set $\{S_i^t\}$ but requires an additional set $\{V_i^t\}$ . Associated with a state

is a parameterized choice model that assigns a probability to each possible state duration $\Delta$ from $1-\infty$ . Edges exist from

to states $V_i^{t+\Delta}$ and along each, $\Delta$ observations are generated according to the same distribution. To express this strictly within the FGM framework a linear sequence of $\Delta$ edges is necessary with a single observation model attached to each. Associated with $V_i^{t+\Delta}$ is a simple discrete choice model induced by transition matrix

. Edges are then directed to the corresponding $S_j^{t+\Delta}$ state with no attached observation model. Notice that since the diagonal of

is zero there is no edge leading to $S_i^{t+\Delta}$ . Even though the choice model associated with each

state has infinite domain, the observation is finite so only a bounded number of edges are included in the FGM.

The simplest such infinite choice model has a parameter for every duration value - a nonparametric approach. Since the training set is finite this reduces to a standard discrete choice model which is easily maximized. Because of limited training data one might prefer a model with fewer parameters. A simple way to reduce parameters is to group them into bins and assume that within a bin probability is uniform. For example bin sizes might be $1,2,4,8,\ldots$ . Here duration

lies in the first bin, durations

in the second,

in the third, and so on. Parametric models may also be used.

We close our discussion of HMMs with the topic of nonemitting states, i.e. states that do not generate any output. These complicate our reduction to FGMs and conventional processing as well since cycles among them may have unbounded duration. Common practice is to preprocess models with such states in order to eliminate self-cycles. Denote by

the set of nonemitting states and by

the rest. Now suppose a transition occurs from an ordinary state to a member of

. The next transition may leave

immediately or cycle through its states for some time before doing so. But since no observations are generated we are merely interested in the ultimate departure event. For each state in

it is straightforward to compute the probability of ultimately exiting to one of the states in

. These values become entries in the transition matrix associated with that state. All entries associated with self-transitions among

are set to zero. This modified machine will impose the same extrinsic distribution as the original but is free of nonemitting cycles and may be reduced to an FGM.

Stochastic Context Free Grammars

A stochastic context free grammar (SCFG) is a conventional context free grammar [HU79] with a probability attached to each production such that for every nonterminal

the probabilities attached to all productions `` $A \rightarrow \ldots$ '' sum to unity. Applied to a sentence of finite length these grammars have an interesting recursive FGM structure. Baum-Welch/EM may then be used to learn the attached probabilities from a corpus of training sentences. The result is essentially the same as the inside-outside algorithm described in [Por88].

We assume the grammar

is in Chomsky normal form and use

to denote the input sentence. Viewing

as a stochastic generator of strings each nonterminal

may expand into sentences of many lengths. We write $P(A \rightarrow t[i \ldots j])$ to denote the probability that

will generate the substring of

consisting of its

th through

th positions. In particular the probability of

is $P(S \rightarrow t[1 \ldots L])$ where

denotes the length of

and

is the start symbol. For a terminal

, $P(b \rightarrow t[i \ldots j]) = 1$ if

and

, and zero otherwise. The SCFG is a probability function over all sentences, so that it is a subprobability model for those of any fixed length. For this reason null models (described earlier) are employed in the construction to truncate the normally infinite generation process.

The FGM corresponding to

restricted to length

is a recursive construction starting with the start symbol. We name the top level FGM $S^{1 \ldots L}$ because it computes the probability of

applied to $t[1 \ldots L]$ . The construction is the same for each nonterminal

with corresponding productions $A_1, \ldots, A_{N(A)}$ where

denotes the number of productions of the form `` $A \rightarrow \ldots$ ''. The FGM $A^{i \ldots j}$ has

edges from source to sink and computes $P(A \rightarrow t[i \ldots j])$ . Its single choice model corresponds to the probabilities associated with

in the grammar. Each edge generates observations $i, \ldots, j$ and the attached models are denoted $A_1^{i \ldots j},\ldots,A_{N(A)}^{i \ldots j}$ where each $A_\ell^{i \ldots j}$ is defined as follows:

The time and space complexity of the resulting model follows easily from an analysis of its structure. There are

FGMs of the form $A^{i \ldots j}$ and

of the form $[BC]^{i \ldots j}$ . The first kind contains

edges while the second has

edges. The result is

time and space complexity. The same counting of FGMs and edges demonstrates that complexity is linear in the grammar's size.

Beyond emulating conventional SCFGs, our FGM construction suggests more complex models. For example, it is possible to learn a probability for each production conditioned on the length of the sentence it generates. Here the single value associated with each production is replaced with a vector of probabilities indexed by sentence length. The vector's length may be set to the maximum length of a sentence in the training corpus. The result is a model which generates only sentences of finite length. More generally each vector position may correspond to a range of lengths, e.g. less than 5, between 6 and 20, 21 and over. If the final range is open-ended then the model generates sentences of unbounded length, as does a SCFG. This new model has more parameters, but given enough training data may outperform an SCFG. To implement it one merely ties the choice models associated with each instance of each FGM $A^{i \ldots j}$ differently. In the construction above they are all tied. Instead we tie the choice model for $A^{i \ldots j}$ to $A^{k \ldots \ell}$ provided $j-i = \ell-k$ , i.e. both instances generate output of the same length.

Normal Mixtures

where $\sum_{i=1}^k c_i = 1$ , $c_i \geq 0$ for $i=1,\ldots,k$ , and $N(\cdot)$ denotes a multivariate normal density. Such a mixture corresponds to a trivial FGM that contains exactly

DAG edges, each from source to sink, and a normal density along each. The FGM's single choice model corresponds to the mixing parameters $c_1,\ldots,c_k$ . Unlike transduction, HMMs, and SCFGs, normal mixtures are an example of an FGM which generates an observation tuple of fixed finite dimension. They may be used as the observation model along any edge in a parent FGM. Our earlier reduction of HMMs to FGMs then results in the mixture-density HMMs widely used in speech recognition. To illustrate the variations possible we observe that tying every component with the same index to a single density yields the semi-continuous variation [HJ89].

Portfolio Optimization

Our final example consists of a problem that is not strictly speaking a stochastic probability model but nevertheless fits easily into the FGM formalism. Given stocks $x_1,\ldots,x_d$ and observations of their returns over many fixed time periods, the objective is to decide how much of each should be held in an optimal investment portfolio. A trivial reduction to FGMs results in a simple iterative algorithm. This problem is considered in [Cov84] where the same algorithm, essentially a rediscovery of Baum-Welch/EM, is described. The multiplicative update perspective of [Cov84] is essentially the growth transform of [BPSW70].

We describe this problem using the notation of [Cov84] and reduce it to an FGM. A stock which appreciates by $25\%$ in say one month, is said to have a return of

. The returns of each stock over time period

forms a vector denoted

. We use stochastic vector

to denote our portfolio allocation and the portfolio's return over period

is then

. This is captured by a simple FGM consisting of a

-way choice corresponding to

with each edge leading directly to the sink. An unnormalized observation model is attached to each edge and assumes the value of the selected stock's return. Given

observations $X^1,\ldots,X^T$ , and letting $\mbox{diag}(b)$ denote the diagonal matrix formed from

, one Baum-Welch/EM iteration is given by:

after $\overline{b}$ is normalized to have unity sum. This matches the equation given in [Cov84]. Assuming the

occur with equal probability then maximizing the training set's likelihood also maximizes our expected log return from the portfolio.

Other Applications

Bayes nets [Pea88], also known as graphical models [Lau96], can be represented in their simplest forms as FGMs in which vertices correspond to variables, edge weights to conditional probabilities, and vertex a posteriori probabilities ( $\gamma$ ) to belief. More complex networks can still be represented as FGMs although the reduction is more involved. The relationship between these networks and hidden Markov models is elucidated in [SHJ96].

Exploring the relationship of FGMs to error correcting codes such as those based on trellises, and the recently developed class of turbo codes [BGT93], is a subject for future work. We wonder broadly whether our framework's generality might lead to better codes, and in particular whether DAGs that encode nonlinear time orderings might have value. Such a possibility is suggested by [WLK95,Wib96]. The relationship between Turbo codes and Bayes nets is examined in [MMC96] and understanding the relationship of FGMs to this specific outlook is another interesting area for future work.

$\displaystyle \log S$	$\textstyle =$	$\displaystyle \sum_i R_i \log S \rule{0pt}{38pt}$	(2.1)
	$\textstyle =$	$\displaystyle \sum_i R_i \log T_i/P_i$	(2.2)
	$\textstyle =$	$\displaystyle \sum_i R_i \log T_i - \sum_i R_i \log P_i$	(2.3)