We have seen that the DAG-based FGM perspective can be used to express many models. This chapter describes certain aspects of the design of a model assembly language MODL based on the FGM framework - and represents an extended abstract describing our implementation efforts in progress. In our view, hidden Markov models, stochastic context free grammars, Bayes nets, stochastic transducers, mixture models, and many others are best thought of and perhaps implemented as high level languages with which one can succinctly specify members of a particular model class. The computations ultimately performed may all be expressed as FGM operations suggesting that one should first compile these high-level specifications into FGM form, where they can be executed by a single computational engine. This approach sweeps many technical details such as extended numerical range requirements and the intricacies of recursively specified models, down to the assembly language level. Another benefit of our approach is that alternative optimization criteria such as MDL are conveniently implemented without major impact on the high level design.
The kernel of the MODL system is ANSI C-based and includes a subroutine-level programmer's API. Parameter passing is particularly simple so that interfaces to other languages are straightforward.
Because of the subroutine-level API's simplicity, an interpreted ASCII scripting language is easily wrapped around it to further simplify its use and broaden the set of potential users. Such an ASCII model specification might then be generated by any language, or perhaps by specialized graphical model design tools - and represents a highly portable model description.
Additional APIs might be developed for popular engineering/mathematics support programs such as Mathematica and Matlab4.1.
The system is object oriented, and even in the subroutine API, objects are named using character strings. Objects can be written or read from a file, and a portable ASCII representation is always used.
The components of a complex recursive FGM that includes observation models that are themselves FGMs, may be specified in an arbitrary order. At specification stage only names are recorded for each object referred to. Once the set of required objects has been specified, the top-level FGM is assembled into a working model. At this time all references are checked with respect to existence and type, and references by name are converted to internal pointers. The resulting object is then a functioning model to which data may be presented.
Additional run-time checks exist to help ensure model correctness. The error reporting philosophy is that errors should be described as thoroughly as possible, but no attempt is made to recover, i.e. the program terminates. This is accomplished using a uniform approach to error reporting that identifies several levels of context to aid in diagnosis. In the case of the scripting language, this context begins with the input line number, includes the software API subroutine involved, and may include more detailed internal information.
Several software development steps are taken to help ensure the quality and portability of the resulting system. First, it is based at the lowest level on memory allocation primitives and other modules of libpa - the ``library of practical abstractions'' [RY97b]. Because of this, run-time analyzers such as Purify4.2 are particular effective identifiers of common implementation problems. At a higher level, the subroutine library is built on top of a single object infrastructure developed for MODL . The library is compiled and tested on a wide variety of systems including Sun Solaris, Linux i586, DEC Unix4.3, and if possible on SGI Unix and Microsoft Windows NT. A library test program and test scripts are included with the planned distribution. In addition, a quick system self-check is provided at the subroutine API level to give users additional confidence that MODL is operating properly on their system. Finally, two versions of the library and scripting language interpreter are made available. The first has a great deal of ``asserts'' enabled in the MODL system and libpa libraries as well. The second disables them resulting in a considerable performance increase.
The system may be extended over time to add new primitive model types. These must be implemented in C on top of the system's object infrastructure and subject to other interface requirements.
The rest of this chapter discusses the structure of the library and its API. The scripting language remains to be designed but we anticipate that it will consist of a straightforward encapsulization of the subroutine API.
Objects are created within a specified memory set (MSET). They may be destroyed individually, or for convenience as part of the destruction of an entire MSET. The caller may create a hierarchy of MSETs starting from a single root. The destruction of any element destroys all children as well. The MSET approach is taken to simplify the generation of programs free of memory leaks. An alternative considered was the implementation of a specialized garbage collector made possible through the use of intelligent pointers.
Objects have an ASCII name within a single global name space. Some objects support a copy operation and most support file read and write operations that allow the object to be saved in a portable ASCII form.
Observation vectors are not objects within the system, but rather are simply arrays of double; each element of which has one of three attributes: ordinary, unknown, or out-of-bounds. A component of known value is of type ``ordinary''. Components labeled ``unknown'' cause FGM evaluation to attempt to marginalize with respect to them. It is also possible to infer the value these positions with the result written back into the corresponding positions. Hidden Markov and many other models correspond to a DAG with depth dependent on the length of the observed time series. Using MODL one creates a DAG deep enough to represent the longest expected series, and then uses the ``out-of-bounds'' attribute to mark positions beyond the end of the series under consideration. This is a general mechanism that applies to other variable length models as well. Discrete observations are represented using integer values. Run time range, value, and dimensionality checking is performed by the observation models and selection functions that deal with observations.
In addition to libpa, the system's utility substrate includes error reporting routines, common support for input and output of basic data types, and a data structure hashing facility used for self-checking. It also includes several as of yet unreleased libpa modules providing linear algebraic, normal density related, and dictionary functions.
There are only seven object types in the system:
This object type is best thought of as an FGM in symbolic form. That is, the references to things like observation models attached to its edges, consists of object names. The entities referred to need not yet exist when the specification is made.
The creator returns an empty specification that includes an empty DAG. A procedure is provided to add new vertices to the DAG. Vertices are identified with natural numbers. Zero designates the source. The enumeration need not be dense, but space is linear in the maximum value used. When a vertex is created a choice model is identified along with an argument (see discussion of arguments below).
Once nodes and exist, a procedure may be called to connect them using a directed edge. The corresponding choice model index must be specified along with a selection function and argument, and observation function and argument. An edge name is also assigned for use in Viterbi decoding.
The object supports topological sorting of its vertices at which time any cycles or certain other structural problems are detected. The user never invokes this sort however. It is called by the creator (assembler) of an observation model of type FGM. File I/O is supported for FGM specifications, but not for their assembled form. That is, reassembly is required each time an FGM is read from a file.
The observation model object class is in some sense the heart of the system. Presented with an observation, members evaluate its probability4.4 or support Baum-Welch/EM iteration. Other operations are described below.
The initial release of MODL supports three types of observation models: simple discrete probability functions, multivariate normal densities, and FGMs. A creator is provided for each. The FGM creator accepts as input an FGM specification, and produces an FGM observation model object. In this assembly process, an entirely new data structure is created from the specification. All symbolic references become simple pointers. An important value added by this step has to do with recursive FGMs. Such a system is organized into form that can support computation in time linear in the overall edge count. This amounts to properly ordering all required computations down to the primitive model level. Equivalent invocations of observation models are detected and only performed once. Here ``equivalent'' refers to the combination of selection and observation. What is considered is the net selection, not the sequence of steps. So the result of several layers of selection by projection can be identical to a single selection. This capability is important when implementing models such as stochastic context free grammars.
In the initial MODL release the observation argument object holds a boolean value that selects between ordinary model instances, and those required to implement MDL reestimation (see chapter 2). When the argument is true, the observation model assumes a value based on the observation. When false, its value represents a penalty term corresponding to the current value of the model's parameters. These penalty terms form in the simplest case a linear chain from the DAG's root and represent multiplication of the standard observation probability by the appropriate penalty term.
A Viterbi decode procedure may be called to locate a maximum weight (most likely) source-sink path through the DAG. A sequence of edge names is returned. An inference call is provided that replaces all unknown observation values with their most likely value given the known observation components. Evaluation attempts to marginalize with respect to unknown components. As a result one can easily compute all conditional probabilities as well. Because of the FGM framework's generality, these calls may not make sense in all settings, and for others may not return an exact answer.
During the evaluation -pass, encountering an out-of-bounds observation adds a temporary zero weight edge directly to the sink. The -pass then uses this list in its processing. In this way only the model edges relevant to the available observation data are processed.
The parameters of an observation model need not be reestimated during a maximization step, and a special mutator call provides this option.
A choice model object is associated with each vertex in an FGM and gives the probability of choosing each outgoing edge. A choice model is essentially an observation model in which the observation consists of an edge index. Only the simple choice model corresponding to a discrete observation model is provided in the initial release of MODL . In this release the choice model argument object operates as for observation models to implement MDL.
A selection function object represents a family of selection functions parameterized by a given selection function argument object. The initial release of MODL supports a single family: coordinate-wise projectors. The argument specifies which observation tuple positions are to be included, and in what order.
Selection function objects support argument composition. For projectors this combines two projections to yield an argument object representing their combination. Comparison of argument objects is also supported. In this way the detection of equivalent selection functions is accomplished in a highly general way.