Up: Publication homepage

Finite Growth Models

Eric Sven Ristad - Peter N. Yianilos

Research Report1 CS-TR-533-96
December, 1996 (revised April 1997)

Abstract:

Finite growth models (FGM) are nonnegative functionals that arise from parametrically-weighted directed acyclic graphs and a tuple observation that affects these weights. The weight of a source-sink path is the product of the weights along it. The functional's value is the sum of the weights of all such paths. The mathematical foundations of hidden Markov modeling (HMM) and expectation maximization (EM) are generalized to address the problem of functional maximization given an observation. Probability models such as HMMs and stochastic context free grammars are examples that satisfy a particular constraint: that of summing or integrating to one. The FGM framework, algorithms, and data structures describe these and other similar stochastic models while providing a unified and natural way for computer scientists to learn and reason about them and their many variations.

Restricted to probabilistic form, FGMs correspond to stochastic automata that allow observations to be processed in many orderings and groupings - not just one-by-one in sequential order. As a result the parameters of a highly general form of stochastic transducer can be learned from examples, and the particular case of string edit distance is developed.

In the FGM framework one may direct learning by criteria beyond simple maximum-likelihood. The MAP (maximum a posteriori estimate) and MDL (minimum description length) are discussed along with the application of FGMs to causal-context probability models and unnormalized noncausal models.



Keywords:Expectation Maximization (EM), Hidden Markov Model (HMM), Baum-Welch Reestimation, Stochastic: Model, Grammar, Transducer, Transduction, Metric Learning, Maximum Likelihood (ML), Maximum a Posteriori (MAP), Minimum Description Length (MDL), String Edit Distance, String Similarity.


Introduction

Hidden discrete-state stochastic approaches such as hidden Markov models (HMM) for time series analysis, stochastic context free grammars (SCFG) for natural language, and statistical mixture densities are used in many areas including speech and signal processing, pattern recognition, computational linguistics, and more recently computational biology. These are parametric models and are typically optimized using the Baum-Welch, inside-outside, and expectation maximization (EM) algorithms respectively. They share a common mathematical foundation and are shown to be instances of a single more general abstract recursive optimization paradigm which we refer to as the finite growth model framework (FGM) involving non-negative bounded functionals associated with finite directed acyclic graphs (DAG). These functionals need not be probabilities and an FGM need not correspond to a stochastic process.

With FGMs interesting new kinds of stochastic models may be designed, existing models may be improved, settings that are not strictly probabalistic in nature may be addressed, and it becomes easier to design and reason about a wide range of problems. This paper introduces and develops the FGM framework and then applies it to:

  1. the definition of $k$-way stochastic transducers whose parameters may be conveniently optimized based on training data;

  2. the special case of $2$-way string transduction which is shown to correspond closely with the notion of string edit distance where parameter optimization corresponds to learning insert, delete, and substitute costs that best explain a training corpus of similar strings;

  3. the inclusion of model complexity tensions such as minimum description length (MDL) and maximum a posteriori parameter (MAP) estimation within models such as HMMs while retaining the ability to easily optimize them;

  4. show that context dependent observation functions may be used to extend the power of conventional stochastic models while preserving the convenience of parameter reestimation;

  5. consider problems that are not strictly probabilistic in nature such as parameter estimation for unnormalized noncausal image models and investment portfolio optimization; and

  6. provide a unified framework for understanding existing models such as SCFGs and HMMs along with their many variations by giving time and space efficient reductions to FGMs.

Graphical models [20,27] represent a branch of related work in which the independence structure of a set of random variables is the main focus. One can express such models using FGMs - as well as considerably more general settings such as that in which no two variables are independent (their graphical model is fully connected), but they are jointly constrained in interesting ways.

An FGM is a directed acyclic graph with a single source and sink together with a collection of non-negative bounded weight functions $\{w_i\}$ that are associated with edges - one per edge. The value of a source-sink path is the product of the weights along it and the FGM's value is the sum over all such paths of their values. Section 0.2 presents a formal development but we will sketch the basic ideas here.

The weight function $w_i(z_e\vert\Psi_i)$associated with edge $e$accepts an argument $z_e$and parameters $\Psi_i$. The objective of FGM optimization is to maximize the FGM's value over its parameters $\{\Psi_i\}$ which are assumed to be independent. The argument $z_e$is regarded as fixed. The same weight function may be associated with several edges but will in general assume different values since $z_e$is a function of the edge to which it is attached.

If each edge has a distinct associated weight function then optimization is immediately reduced to the task of independently maximizing the weight functions in isolation. The interesting situation arises when this is not the case. Here a change to a single parameter set $\Psi$ may affect weights throughout the DAG. The values of many source-sink paths may be affected and an interaction between members of $\{\Psi\}$ arises, i.e. simple independent maximization is no longer possible.

Surprisingly it is still possible to decompose the optimization problem into certain independent subproblems. This is the key mathematical message of Baum-Welch and EM generalized to the FGM framework. However, even exact solution of the subproblems does not yield a maximum for the FGM. But a new combined parameter set does result that strictly improves it -- unless one has already converged to a limiting value. Iteration is then used to climb to such a limiting value. Unlike gradient descent with line search this approach confidently moves to a sequence of increasingly better points in parameter space and in many cases reduces to a strikingly simple and intuitive algorithm. The cost functions may themselves be FGMs and the computation of an improved parameter set simply depends upon the ability to solve the primitive maximization problems at the bottom of the recursive hierarchy.2

The mathematical idea behind this decomposition begins with a simple equality having to do with the $\log$of a sum of terms. Let $S = T_1
+ \ldots + T_n$and define proportions $P_i = T_i / S$. Notice $S =
T_i/P_i$. For any nonnegative vector $R$with unity sum:


$\displaystyle \log S$ $\textstyle =$ $\displaystyle \sum_i R_i \log S \rule{0pt}{38pt}$ (1)
  $\textstyle =$ $\displaystyle \sum_i R_i \log T_i/P_i$ (2)
  $\textstyle =$ $\displaystyle \sum_i R_i \log T_i - \sum_i R_i \log P_i$ (3)

Observe that the second term is maximized when $P=R$since both are stochastic vectors.

In our setting each term $T_i$corresponds to a source-sink path, and its value is not constant, but rather is a function of some parameter set ${\Psi^\prime}$. Optimizing the FGM is then the problem of maximizing their sum over ${\Psi^\prime}$. This is the same as optimizing their $\log$ sum; connecting the equality above to the problem of FGM optimization.

Accordingly we parameterize the equality by writing $S({\Psi^\prime}) =
T_1({\Psi^\prime}) + \ldots + T_n({\Psi^\prime})$and $P({\Psi^\prime}) = T_i({\Psi^\prime}) /
S({\Psi^\prime})$where our objective is to maximize $S$over ${\Psi^\prime}$. The search for a maximum takes place iteratively. During each iteration we denote the current parameters by $\Psi$and vary ${\Psi^\prime}$looking for an improved set.

To express this iterative approach the equality is specialized by defining $R$to be $P$evaluated at the current parameters $\Psi$, i.e. $R \triangleq P(\Psi)$. Then:


\begin{displaymath}
\log S({\Psi^\prime}) = \sum_i P_i(\Psi) \log T_i({\Psi^\prime})
- \sum_i P_i(\Psi) \log P_i({\Psi^\prime})
\end{displaymath}

The right term may be ignored because any departure from $\Psi$ will decrease it whereby $\log S$and therefore $S$are increased. We then focus on maximizing the left term which corresponds to the $Q$function of the HMM/EM literature.

Elements of the remaining summation correspond to source-sink paths through the FGM's DAG and the proportions $P_i(\Psi)$are the relative contribution of each path to the FGM's value. Each term $T_i({\Psi^\prime})$is then a product of of weight functions and breaks up into a sum of individual $\log$terms. The proportions are then distributed over them and the final step is to group terms by weight function. The problem is then decomposed as required into independent subproblems each involving a single weight function.

The argument above is made in much greater detail in section 0.2 and the formal definition of an FGM presented there is more involved in two respects. First, there are two weight functions, not one, associated with each edge. These are abstractions of the normally separate transition and observation generation functions of a stochastic automaton. Second, the $z_e$in our discussion above is defined to be some function of an observation tuple $x$. A simple such function is that of coordinate projection where, for example, the function's value might be that of a single dimension of $x$.

The contribution of section 0.2 is the development of a framework that has nothing directly to do with stochastic models or probability in which the essential ideas of Baum-Welch and EM nevertheless apply. Our view is that the essential content of these ideas is one of decomposability of certain optimization problems that are captured in a somewhat general way by our definition of an FGM.

The early literature, especially theorem 2.1 of [3], reveals an awareness that the mathematics of HMMs apply beyond strictly probabalistic settings. This direction of generalization does not however appear to have been followed until now. Thinking seems to have been dominated by the motivating problem: the optimization of Markovian stochastic models. Because of the limited computational capabilities of the time and their highly mathematical viewpoint a great deal of emphasis was also placed on the primitive weight functions for which exact maximization may be performed in closed form, and those which give rise to unique global maxima. By contrast we devote very little attention to the the specific characteristics of these primitive functions and view them instead as an abstract types that by assumption support a set of required operations.

In section 0.3 the FGM framework is specialized to the case of probabilistic models. The $z_i$above result from projections that select one or more coordinate dimensions such that along each source-sink path each dimension is selected exactly once. Also, the weight functions are restricted to be probability functions or densities. It is not important that the dimensions occur in any fixed order as one follows alternative source-sink routes; nor is it required that the dimensions be selected one at a time. The contribution of this outlook is that FGM-based stochastic models can generate observations in many orderings and groupings - not just one-by-one in sequential order as in HMMs - a capability exploited later in our discussion of stochastic transducers.

Section 0.4 begins by considering the case of infinite stochastic processes. Viewed generatively these models produce output of unbounded size but are used in practice to evaluate the probability of finite observations and are trained using a finite set of finite observations. Restricted in this way they correspond to FGMs that capture their truncated form. Moreover the conventional evaluation and learning algorithms associated with them arise naturally from this viewpoint.

Baum-Welch and EM are today regarded as maximum-likelihood parameter estimation techniques. A more modern learning theoretic perspective recognizes the important tension between model complexity and available training data. An important contribution of section 0.4 is the description of a construction that modifies the FGM associated with a stochastic model so that alternative criteria such as MDL or MAP can guide optimization. The optimization is still based on Baum-Welch/EM so we suggest that these should not be viewed as statistical ML techniques and that their broader utility is revealed by our more general functional optimization perspective.

The observation functions associated with each state in a conventional Hidden Markov model must satisfy an independence assumption: that their state dependency is limited to the current state (the Markov condition). Section 0.4 observes that conditioning on earlier portions of the observation sequence does not violate this assumption and demonstrates that using the FGM framework, it is easy to see that parameter reestimation for such context dependent HMMs decomposes into primitive context optimization problems. This generalizes Brown's use [7] of the previous speech window to condition generation of the current one. Thus the power of causal context models such as statistical language models may be added to the HMM framework and may lead to improved performance.

The section also presents another illustration that the reach of FGMs extends in a useful way beyond conventional and strictly stochastic models. Causal context models express the probability of an observation as a product of conditional probabilities such that each dimension is predicted once and does appear earlier as a conditioning variable. These restrictions are constrictive in settings such as image modeling where modeling each pixel based on its surrounding context seems natural. Simply relaxing the requirements that probabalistic modeling places on an FGM's source-sink projection functions proves that FGM optimization will improve even noncausal models like the image example above.

Stochastic models describe the manner in which an object is generated. This object is conventionally imagined to be something like an image, an audio recording, or a string of symbols. If instead one models pairs of such objects the focus shifts from describing the nature of an individual to uncovering the relationship between them. The pairs may be thought of as instances of the are similar concept and learning an effective model to describe them is an example of what the authors have called the metric learning paradigm.

Fixing the left element of the pair one might then ask what generative sequences could explain the right. This is the essential idea of $2$-way stochastic transduction. Section 0.5 shows that a somewhat general automaton-based definition for $k$-way transduction may be reduced to problems within the FGM framework. One may view speech recognition as a transduction, perhaps in several stages, and the application of learned transducers to this field represents an interesting area for future work.

Perhaps the best known example of $2$-way transduction is that of string edit distance which is defined as the least costly way to transform one string into another via a given set of weighted insertion, deletion, and substitution editing operations. The development in section 0.5.1 begins with a natural reformulation into stochastic terms first noted by Hall and Dowling in [12, P.390-1]. We refer to the result as stochastic edit distance [24].

Our focus is on the interesting problem of learning insert, delete, and substitute costs from a training corpus $(s_1,t_1), \ldots,
(s_n,t_n)$of string pairs. The objective is to minimize the total stochastic edit distance of the collection. In [24] the authors give such a learning algorithm independent of the FGM framework and report on experiments. Until now the costs which parameterize edit distance algorithms have been prescribed not learned. Section 0.5.1 of this paper describes the reduction of this learning problem to the FGM optimization framework and may be viewed as an adjunct to [24] providing a proof of correctness.

The optimized costs are often quite different from the initial ones. It is in this sense that we learn them. In previous work, the costs have been stipulated after study of the problem domain. Even worse, maximally uninformative unit costs (i.e., Levenshtein distance) are sometimes stipulated when the problem domain is nontrivial. It is now possible to improve any educated guess, or start virtually from scratch from unit costs, to arrive at a set of optimized ones. Algorithms 7 and 8 of section 0.5.1 run in time proportional to the product of the two string lengths, matching the complexity of the conventional edit distance computation. Since edit distance has found widespread application our discovery of a simple way to learn costs may lead to improved performance in many problem areas.

Section 0.5.1 also observes that the notion of string edit distance may be strengthened so that the cost of an edit operation depends not only on its target but also on context. For example, such context-sensitive string edit distances allow the cost of inserting a symbol to depend on the symbol left-adjacent to the insertion site. The result remains an FGM on one therefore the ability to easily improve costs given a training corpus is retained.

Several kinds of hidden discrete-state stochastic models are in widespread use today. Hidden Markov Models (HMM) were introduced in the 1960's [2,3,1] and are typically applied in time-series settings when it is reasonable to assume that the observed data are generated or approximated well by the output of an automaton which changes state and generates output values stochastically. A matrix-based approach is typically used to formalize this notion, and [21] and [14] provide nice overviews. Normal (Gaussian) mixtures find application in statistical pattern recognition where items to be classified are represented as real-valued vectors. Stochastic context free grammars (SCFG) lead to a natural probabilistic interpretation of natural language parsing. Other model types exist, and are easily formed by combining and modifying the standard ones.

The situation is complicated by parameter tying; the practice of reducing the number of parameters in a model by equating two or more different sets of formal parameters to a single set. Observations are sometimes discrete and sometimes continuous. Entire models may be mixed together or recursively combined. Some states or transitions may be nonemitting meaning that no observation is generated. Finally, nonuniform time-series models may predict several future data points. All of these factors make it difficult to write down a single high-level framework with which to reason about and verify the correctness of a model's associated algorithms.

The contribution of FGMs is to move instead to a lower-level representation which is common to all of the models and complicating variations above. As noted earlier isn't really the original model which is represented, but rather its truncation to the given collection of finite observations. By understanding FGMs it is possible to reason about the others as mappings from their individual description language to an FGM given finite observations.

Section 0.6 relates the models above to FGMs by describing a reduction for each. As further illustration it considers the portfolio optimization problem and shows that the simple iterative algorithm reported in [9] for optimization arises by reducing the problem to an FGM.

The work described in this paper began in 1993 with our efforts to model handwriting. The focus then was on transduction, in an attempt to evaluate and learn a notion for similarity for the off-line case. This effort exposed many limitations of the conventional modeling outlook - problems that extend far beyond handwriting. To address these limitations we had to break several rules of conventional modeling only to ultimately discover that these rules are in many cases unnecessary artifacts of the history that led to HMMs and EM. After three years of generalization, and exercises in specialization, we arrived at the FGM outlook reported here.


Finite Growth Models

In this section we develop the theory of Finite Growth Models (FGM) as a class of nonnegative functionals represented as annotated directed acyclic graphs. The next section focuses on the special case of probability models. Our formal definition of an FGM amounts to an intricate recursive data structure, in the computer science sense, that has been crafted to span a wide range of applications. As a result it may seem at first to be rather abstract and have little to do with stochastic modeling, but each feature is an abstraction of some characteristic required by a problem we considered. It is the mathematical engine at the center of these problems. In particular much of our development and notation parallels that of Hidden Markov Models (HMM).

Let $x$be a tuple observation with components $x_1,\ldots,x_d$, and ${\cal X}$denote the space of all such observations. We will not specify the type of each component since for much of our development it matters only that appropriate functions exists on the space of values that a component may assume. Components may therefore represent discrete values such as letters of the alphabet, continuous values such as real numbers, or other more complex structures. Within the observation tuple, component types need not be the same. In simple time series probability-model settings, components do have the same type, and position within the tuple corresponds to the passage of time.

Definition 1   A finite growth model $F$for a $d$-dimensional observation space ${\cal X}$, is an 8-tuple $(V,E,$$M,C,S,m,c,s)$, such that:

  1. $(V,E)$is a finite DAG with vertices $V=v_1,\ldots,v_n$having a single source $v_1$and sink $v_n$. For $e \in E$we denote by $s(e)$the index of its source vertex, and by $d(e)$the index of its destination vertex.

  2. $M$is a finite collection $\{M_i\}$of parameterized nonnegative bounded observation functionals with corresponding parameters $\{\Psi_i\}$. We write $M_i(y\vert\Psi_i)$to denote evaluation of the $i$th observation functional, where the domain is not yet specified.

  3. $C$is a finite collection $\{C_i\}$of discrete-domain parameterized nonnegative bounded choice functionals. The size of the domain of $C_i(j)$is denoted $\vert C_i\vert$, and the domain consists of the integers $1,\ldots,\vert C_i\vert$. The parameters of $C_i$are denoted $\Upsilon_i$and we write $C_i(j\vert\Upsilon_i)$to denote evaluation of the $i$th choice functional.

  4. $S$is a finite collection $\{S_i\}$of selection functions with domain $X$where the range is not yet specified.

  5. $m: E \rightarrow \{1,\ldots,\vert M\vert\}$is a function associating an element of $M$with each edge.

  6. $c: E \rightarrow \{1,\ldots,\vert C\vert\} \times {\mathbb{N}}$is a function that associates each edge with a choice function and a member of its range; which is formalized by denoting the projections of $c$by $c_1,c_2$and requiring:

    1. $s(e_1) = s(e_2) \Rightarrow c_1(e_1) =
c_1(e_2), \forall e_1,e_2 \in E$, so that an element of $C$is in effect associated with each nonsink vertex, and;
    2. For each nonsink vertex $v$, $c_2$ restricted to $\{e \vert s(e)=v\}$is 1-1 and onto the range of $C_{c_1(v)}$.

    The choice value corresponding to an edge $e$is then $C_{c_1(e)}(c_2(e)\vert\Upsilon_{c_1(e)})$and is denoted $C(e\vert\Upsilon)$for brevity where it is understood that in this context $\Upsilon$refers to $\Upsilon_{c_1(e)}$- but the subscript is sometimes written for clarity.

  7. $\varsigma: E \rightarrow \{1,\ldots,\vert S\vert\}$is an function that associates each edge with a selection function such that for all edges $e$, the range of $S_{\varsigma(e)}$matches the domain of $M_{m(e)}$. The observation value corresponding to an edge $e$is then $M_{m(e)}(S_{\varsigma(e)}(x)\vert\Psi_{m(e)})$and is denoted $M(e,x\vert\Psi)$for brevity where it is understood that in this context $\Psi$refers to $\Psi_{m(e)}$- but the subscript is sometimes written for clarity.

The combined observation and choice parameters are denoted $\Psi$and $\Upsilon$respectively, and $\Phi
\triangleq (\Psi,\Upsilon)$denotes the parameters of $F$. The value of a source-sink path is the product of the observation and choice functional values along it. The value $F(x\vert\Phi)$of the FGM is the sum over all source-sink paths, of the value of each. A choice or observation model whose value is the constant $1$does not affect an FGM's value and is therefore said to be void.

Notice that since an FGM is itself a nonnegative parameterized functional, the definition above may be applied recursively to define an FGM whose constituent functionals are themselves FGMs.

Direct evaluation of an FGM's value by enumerating all source-sink paths is of course not practical. Fortunately $F(x\vert\Phi)$may be efficiently computed using dynamic programming. To express this precisely requires some notation. Let $v_1,\ldots,v_n$be a topologically sorted vertex listing. Following the convention of the HMM literature, we define corresponding real variables $\alpha_1,\ldots,\alpha_n$. The sets of edges into and out from a vertex $v$are denoted by ${\cal I}(v)$and ${\cal O}(v)$ respectively. Using the notation above, the contribution of an edge $e$to the value of a path containing it, is $C(e\vert\Upsilon) \cdot
M(e,x\vert\Psi)$The product of these contributions is the probability of the path. The complete dynamic program is then given by the following simple algorithm:

Algorithm 1   ` =13`^M=13 procedure $\mbox{\it forward}$ $\alpha_1 = 1$ for $i = 2,\ldots,n$ $\alpha_i = \sum_{e \in {\cal I}(v_i)}
\alpha_{s(e)} \cdot
C(e\vert\Upsilon) \cdot
M(e,x\vert\Psi)$

This may be recognized as the forward step of the Baum-Welch HMM forward-backward algorithm [2,1,21] adapted to the more general FGM setting. Following execution $\alpha_n$has value $F(x\vert\Phi)$.

Algorithm 1 is said to evaluate the FGM using its current set of parameters. The optimization problem we consider seeks a parameter set that maximizes the FGM's value


\begin{displaymath}
\overline{\Phi}= \mathrm{arg max}_\Phi F(x\vert\Phi)
\end{displaymath} (4)

but we will be content to find locally optimal solutions, and in some cases to simply make progress. Here $x$is fixed and it is the FGM's parameters that are varied.

The end-to-end concatenation of two FGMs clearly results in another whose value is the product of the two, so that given a set of observations $x_1,\ldots,x_m$, the optimization problem


\begin{displaymath}
\overline{\Phi}= \mathrm{arg max}_\Phi \prod_{i=1}^m F(x_i\vert\Phi)
\end{displaymath} (5)

is really just an instance of Eq. 4. Viewing $\{x_i\}$as training examples the optimization then learns a parameter set, or in the language of statistics and assuming the FGM corresponds to a probability model, estimates its parameters.

Our approach to this problem extends the idea of Baum-Welch to FGMs. Several years following the development of HMMs, essentially the same mathematical ideas expressed in the their algorithm were rediscovered as the expectation maximization (EM) algorithm for mixture densities [11,22]. This work focused on parameter estimation for mixture densities, and has had considerable influence on a somewhat independent community. In what follows we will use the phrase Baum-Welch/EM to refer to the adaptation of these results to the FGM setting.

The HMM and EM literature deals with probabalistic models, and Baum-Welch/EM is presented as a method for reestimating their parameters. Our mathematical contribution is the recognition that the method is not essentially a statement about probability but rather should be viewed as an iterative approach to the maximization of arbitrary parameterized nonnegative functionals of a certain form. The form consists of a sum of terms each one of which is a product of subterms. An FGM's DAG is a representation of these forms which, for many with appropriate structure, has the advantages of succinct representation and computational economy. In the interesting case, the subterms are partitioned or grouped into families sharing common parameters. The approach then consists of focusing on each family in isolation thus localizing the problem. Individual family maximization problems are constructed that include certain weighting factors. These factors are constant with respect to the family's optimization but are computed using the entire DAG and are therefore influenced by all families. Thus global information is used to decompose the original problem - localizing it to individual families. If parameters are found that improve any of the individual family problems, the FGM's value will strictly increase. The method is iterative since even exact maximization of every family's problem may not maximize the FGM's value. Instead the process is repeated.

To formalize this idea of simply making progress with respect to a given maximization problem, the notation:


\begin{displaymath}
\mathrm{arg climb}_{a\Vert b} f(a) \triangleq \{ a \vert f(a) \geq f(b) \}
\end{displaymath}

is introduced. Notice that the improvement need not be strict and that the result is a set of values.

Our definition of an FGM is clearly an abstraction of stochastic modeling. We remark that it is possible to start from a definition that is yet more general and in a sense simpler in which only choice models are used and a variable number may be attached to each edge. But then connecting our results to the mainly stochastic problems we are interested in becomes more complex, and it is more difficult to see the connection between our generalization and the original literature.

We therefore develop Baum-Welch/EM for FGMs starting from a lemma that is stated so as to illustrate its correspondence with the EM literature. It is a statement about improper mixture densities, i.e. where the whole and its constituent parts are not necessarily probabilities. This is relevant to FGMs because they may be regarded as immense mixtures of contributions from every source-sink path. We follow earlier lines in its proof without requiring proper densities. The lemma may also be derived starting from theorem 2.1 of [3] in its most general form.

Lemma 1 (Baum-Welch/EM)   Let $f(x\vert\Phi) \triangleq \sum_{i=1}^k g(x\vert\omega_i,\Psi)
h(\omega_i\vert\Upsilon)$be a finite parameterized mixture where $\{g(\cdot\vert\omega_i)\}$and $h$are nonnegative parameterized functionals (but not necessarily probability functions or pdfs), $\omega_i$selects a component of the mixture, and $\Phi \triangleq \{\Psi, \Upsilon\}$. Also define probability $P(\omega_i\vert x,\Phi) \triangleq g(x\vert\omega_i,\Psi)
h(\omega_i\vert\Upsilon) / f(x\vert\Phi)$. Finally, assume $f(x\vert\Phi)
> 0$. Then:


\begin{displaymath}
\overline{\Phi}\in \mathrm{arg climb}_{\Psi^\prime, \Upsilo...
...ga_i\vert x,\Phi)\log h(\omega_i\vert\Upsilon^\prime)
\right)
\end{displaymath}

is an improved parameter set. That is, $f(x\vert\overline{\Phi}) \geq f(x\vert\Phi)$, with the inequality strict unless $\Phi$is already optimal. It is understood that $\Psi^\prime, \Upsilon^\prime$vary over their defined domain.

proof: We begin by establishing the identity:

\begin{eqnarray*}
E_{\omega\vert x,\Phi}(\log f(x\vert{\Phi^\prime})) & = &
\...
... P(\omega_i\vert x,\Phi) \\
& = & \log f(x\vert{\Phi^\prime})
\end{eqnarray*}



where $\omega$is viewed as a discrete random variable distributed as $P(\omega_i\vert x,\Phi)$and $E_{\omega\vert x,\Phi}$denotes expectation with respect to it. The point is merely that $f(x\vert{\Phi^\prime})$is independent of $\omega$since the former depends on ${\Phi^\prime}$not $\Phi$.

Next with $f(x,\omega_i\vert{\Phi^\prime}) \triangleq g(x\vert\omega_i,{\Psi^\prime})
h(\omega_i\vert{\Upsilon^\prime})$we have $\log f(x\vert{\Phi^\prime}) = \log f(x,\omega\vert{\Phi^\prime})
- \log P(\omega\vert x,{\Phi^\prime})$since $f(x,\omega\vert{\Phi^\prime}) =
P(\omega\vert x,{\Phi^\prime}) f(x\vert{\Phi^\prime})$. So:

\begin{eqnarray*}
\log f(x\vert{\Phi^\prime}) & = & E_{\omega\vert x,\Phi} \log...
...omega_i\vert x,\Phi)
\log P(\omega_i\vert x,{\Phi^\prime}) \\
\end{eqnarray*}



It follows from Jensen's inequality that the second summation is minimized by $\Phi = {\Phi^\prime}$. This can also be seen by recognizing it as $-[D(\Phi\Vert{\Phi^\prime}) + H(\Phi)]$, where $D(\cdot\Vert\cdot)$is the Kullbak-Leibler distance, and $H(\cdot)$denotes entropy (see [10]). Thus maximizing the first term, written $Q(\Phi,{\Phi^\prime})$in the literature, will surely increase $\log f(x\vert{\Phi^\prime})$and hence $f(x\vert{\Phi^\prime})$. But:


\begin{displaymath}
\sum_{i=1}^k P(\omega_i\vert x,\Phi)
\log f(x,\omega_i\ver...
...(\omega_i\vert x,\Phi)
\log h(\omega_i\vert\Upsilon^\prime)
\end{displaymath}

and the right side is recognized as the object of the $\mathrm{arg climb}$operator in the lemma's statement.$\;\Box$

The conditional expectation operator $E_{\omega\vert x,\Phi}$can cause confusion but we use it to make clear the correspondence with the EM literature. Our discussion in section 0.1 starting with Eq. 1 represents an alternative approach that is free of explicit mention of expectation and probability.

To apply lemma 1 to FGM optimization we introduce two more simple algorithms. The first computes $F(x\vert\Phi)$by sweeping through the DAG's vertices in reverse rather than forward order, establishing the values of a new set of real variables $\beta_1,\ldots,\beta_n$:

Algorithm 2   ` =13`^M=13 procedure $\mbox{\it backward}$ $\beta_n = 1$ for $i = n-1,\ldots,1$ $\beta_i = \sum_{e \in {\cal O}(v_i)}
\beta_{d(e)} \cdot
C(e\vert\Upsilon) \cdot
M(e,x\vert\Psi)$

This corresponds to the backward step of the forward-backward algorithm. After it has run, $\beta_1$equals $F(x\vert\Phi)$, and therefore $\alpha_n$.

We denote by $\omega_1,\ldots,\omega_N$the source-sink paths through $F$, and by $E(\omega)$the set of edges along a particular path $\omega$, and by $\Omega(e)$the set of paths that include edge $e$. The contribution of a single path $\omega$to the overall value of the FGM is denoted $F(x,\omega\vert\Phi)$and given by:


\begin{displaymath}
F(x,\omega\vert\Phi) = \prod_{e \in E(\omega)}
C(e\vert\Upsilon) \cdot M(e,x\vert\Psi)
\end{displaymath}

So that:


\begin{displaymath}
F(x\vert\Phi) = \sum_{i=1}^N F(x,\omega_i\vert\Phi)
\end{displaymath}

Next define $\gamma_\omega \triangleq F(x,\omega\vert\Phi) / F(x\vert\Phi)$, and then for each edge $e$define:


\begin{displaymath}
\gamma_e \triangleq \sum_{\omega \in \Omega(e)} \gamma_\omega
\end{displaymath}

Notice that by definition the $\gamma_e$arise from a partition of the set of all source-sink paths whence it is clear that:

Proposition 1   The sum of the $\gamma_e$for any cut of the DAG is unity.

The $\alpha$and $\beta$variables may be used to compute these $\gamma_e$ via a third simple dynamic program corresponding to the expectation step of EM:

Algorithm 3   ` =13`^M=13 procedure $\mbox{\it gamma}$ for $i = n-1,\ldots,1$ for $e \in {\cal O}(v_i)$ $\gamma_e =
(\alpha_{s(e)} \cdot
C(e\vert\Upsilon) \cdot
M(e,x\vert\Psi) \cdot
\beta_{d(e)}) / F(x\vert\Phi)$

where either $\alpha_n$or $\beta_1$may be substituted for $F(x\vert\Phi)$. As a practical matter algorithms 3 and 2 may be combined.

We are now in position to state precisely how the objective of optimizing an FGM is decomposed into smaller problems.

Definition 2   An inexact Baum-Welch/EM reestimate $\overline{\Phi}=
(\overline{\Psi},\overline{\Upsilon})$of the current parameters $\Phi=(\Psi,\Upsilon)$of an FGM $F$given an observation $x$is formed by solving two sets of independent optimization problems:

  1. For each $i$the improved parameters $\overline{\Psi}_i$of observation model $M_i$are given by:


    \begin{displaymath}
\overline{\Psi}_i \in \mathrm{arg climb}_{{\Psi^\prime}_i\V...
...in m^{-1}(i)} \gamma_e \cdot
\log M(e,x\vert{\Psi^\prime}_i)
\end{displaymath}

  2. For each $i$the improved parameters $\overline{\Upsilon}_i$of choice model $C_i$are given by:


    \begin{displaymath}
\overline{\Upsilon}_i \in \mathrm{arg climb}_{{\Upsilon^\pr...
..._1^{-1}(i)} \gamma_e \cdot
\log C(e\vert{\Upsilon^\prime}_i)
\end{displaymath}

where the $\gamma_e$are computed based on $\Phi=(\Psi,\Upsilon)$. The exact Baum-Welch/EM reestimate is formed by replacing the $\mathrm{arg climb}$operations with $\mathrm{arg max}$. By $m^{-1}(\cdot)$and $c^{-1}(\cdot)$we mean the set valued inverses of functions $m$and $c$.

The exact form corresponds to the historical definition of the Baum-Welch/EM reestimate. However, since this paper's focus is on decomposition, and not mathematical issues relating to particular choice and observation models, we will use ``Baum-Welch reestimate'' to refer to our inexact form defined above. This allows our framework to include models for which exact solution is not convenient. Also it is interesting to note that one may choose to not reestimate one or more models, and still improve the overall FGM.

Theorem 1   The Baum-Welch/EM reestimate $\overline{\Phi}$satisfies $F(x\vert\overline{\Phi})
\geq F(x\vert\Phi)$with the inequality strict provided that progress is made in at least one subsidiary optimization problem.

proof: We write:


\begin{displaymath}
F(x\vert\Phi) = \sum_{i=1}^N \underbrace{ \left(
\prod_{e ...
...i)} C(e\vert\Upsilon_i)
\right) }_{h(\omega_i\vert\Upsilon)}
\end{displaymath}

which makes clear the correspondence between the FGM and lemma 1. Now by their definitions $\gamma_{\omega_i} = P(\omega_i\vert x,\Phi)$so from the lemma we have:


\begin{displaymath}
\overline{\Phi}\in \mathrm{arg max}_{{\Phi^\prime}, {\Upsil...
...\in E(\omega_i)} C(e\vert{\Upsilon^\prime}_i) \right) \right]
\end{displaymath}


\begin{displaymath}
\overline{\Phi}\in \mathrm{arg max}_{{\Phi^\prime}, {\Upsil...
... \gamma_{\omega_i} \log C(e\vert{\Upsilon^\prime}_i)
\right]
\end{displaymath}

The double summations above enumerate every edge along every path though the FGM. To complete the proof we have only to enumerate them in a different order:


\begin{displaymath}
\overline{\Phi}\in \mathrm{arg max}_{{\Phi^\prime}, {\Upsil...
...(e)} \gamma_\omega \log C(e\vert{\Upsilon^\prime}_i)
\right]
\end{displaymath}

Recall $\gamma_e = \sum_{\omega \in \Omega_e}
\gamma_\omega$so:


\begin{displaymath}
\overline{\Phi}\in \mathrm{arg max}_{{\Phi^\prime}, {\Upsil...
... \gamma_e \log C(e\vert{\Upsilon^\prime}_i)
\right)
\right]
\end{displaymath}

and the parenthesized terms are the independent subproblems of definition 2.$\;\Box$

We now change our focus to computational and data structural issues. The first part of our discussion deals with the intricacies of recursion. That is, when an FGM invokes another FGM as one of its observation models. Here, proper algorithm and data structure design can lead to polynomial rather than exponential complexity as illustrated by the case of stochastic context free grammars described in section 0.6.2. We will confine ourselves to finite recursion. The second issue has to do with the arithmetic demands of FGM computation.

The idea behind reducing the complexity of recursive FGM operations is a simple one: avoid duplicate work. If a particular combination of an observation model and selection function occurs multiple times, the combination should only be evaluated once. Since choice models do not depend on the observation, each should be evaluated only once. The computation is then ordered so that when an edge is considered, its choice and observation functions have already been evaluated and their values may be looked up in constant time.

To formalize this idea recall that $M(e,x\vert\Psi)$is defined as $M_{m(e)}(S_{\varsigma(e)}(x)\vert\Psi_{m(e)})$. Notice that edge $e$ appears as a function argument. We say $M(e_1,x\vert\Psi) \equiv
M(e_2,x\vert\Psi)$if their definitions are the same. Each equivalence class is referred to as a model application and expressing algorithms in terms of these classes avoids duplicate work. If for two edges $\varsigma(e_1) \neq \varsigma(e_2)$then by this definition $M(e_1,x\vert\Psi) \not\equiv M(e_2,x\vert\Psi)$- even if $S_{\varsigma(e_1)}$is the same function as $S_{\varsigma(e_2)}$. For this reason we adjust the definition to account for equivalences that may exist among the $\{S_i\}$.

This adjustment is subtle and we therefore illustrate it by example. Suppose that the observation model associated with an edge $e_i$of the top level FGM is itself an FGM $F_a$and that selection function $S_q$is used by $e_i$. Now within $F_a$assume that some edge $e_j$ uses an observation model $m$with selection function $S_r$. Then to evaluate $m$, selection functions $S_q$and $S_r$are composed. Now focus on another top level edge $e_k$to which an FGM $F_b$is attached using selection function $S_t$. Within $F_b$consider an edge $e_\ell$using the same observation model $m$as does $e_j$, but in combination with selection function $S_v$. To evaluate $m$here $S_t$ and $S_v$are composed. So $m$should only be evaluated once if $S_r(S_q(\cdot))$and $S_v(S_t(\cdot))$are the same function. This may be the case despite the fact that $S_r$and $S_v$are not. Thus, whenever possible, equivalence classes should be defined so as to account for possibly equivalent compositions of selection functions. This is easily implemented for the class of projection selection functions introduced in the next section.

Mathematically definition 1 states that each edge has an attached observation model via mapping $m$. Computationally the edge instead selects (in practice points to) a model application structure which identifies the observation model and the corresponding selection function. We denote the set of observation model applications by $\{{\cal M}_i\}$. Inclusion within the recursive hierarchy induces a partial ordering on this set where it is important to realize that a given model application may be referenced at more than one recursive level. We extend this ordering by specifying that all primitive observation models are dominated by any FGM. In what follows we will then assume that $\{{\cal M}_i\}$is indexed in topological order so that the top-level FGM is first and the primitive models form the list's end.

We also introduce choice model application structures $\{{\cal C}_i\}$. Since choice models do not depend on $x$or involve a selection function, these structures are in 1:1 correspondence with the choice models themselves and are identically indexed. Many references may exist to a choice model, and its corresponding application structure avoids redundant model evaluation by recording the model's current value. It also accumulates $\gamma$contributions during expectation step processing so that a single call may be made to maximize the model. This is explained later in greater detail.

If an FGM $F_c$is a recursive child of a parent $F_p$, one alternative is to eliminate the recursion and expand the child inline within the parent by adding vertices and edges to its DAG. If $F_c$is invoked along edge $e$between vertices $i$and $j$then the first step of this inline expansion disconnects $e$from $j$and replaces the invocation of $F_c$with a void model. The child $F_c$is then inserted between the disconnected end and vertex $j$.

When evaluating $F_p$this inline expansion clearly isn't necessary since one can evaluate $F_c$in isolation, record its value in the model application structure, and in constant time refer to that value where needed in $F_p$. The same is true for parameter reestimation although a brief argument is needed.

Proposition 2   Suppose edge $e$of FGM $F_p$invokes model $F_c$, also an FGM. Then if $F_c$is expanded inline within $F_p$, the $\gamma$values of its edges are exactly those computed in isolation multiplied by $\gamma_e$.

proof: Let $e$connect vertices $i$and $j$and let $\omega$denote some source-sink path of $F_c$and $\gamma_\omega$its weight. If $F_c$is expanded inline the value of $\omega$will be $\alpha_i C(e\vert\Upsilon) \gamma_\omega \beta_i / F_p(x\vert\Phi)$. In isolation it is $\gamma_\omega / F_c(x\vert\Phi)$. But $\gamma_e = \alpha_i C(e\vert\Upsilon) F_c(x\vert\Phi) \beta_i / F_p(x\vert\Phi)$ from which the proposition follows.$\;\Box$

So if the same model application ${\cal M}_i$is pointed to by many FGM edges, the right computational approach is to first accumulate their $\gamma$values and then to process ${\cal M}_i$. Here we emphasize that this approach is more efficient and mathematically identical to inline expansion. So if at the bottom of the recursion the required $\mathrm{arg max}$problems are solved, then at all higher levels they are too.

An observation model $M_i$that is not an FGM is said to be primitive, and is represented by an abstract type for which the following operations are defined:

  1. ${M}_{i}\!:\!\mbox{\it evaluate}(z)$
  2. ${M}_{i}\!:\!\mbox{\it Estep}(z, \Gamma)$
  3. ${M}_{i}\!:\!\mbox{\it Mstep}()$

Here $z$is an element of the range of a selection function and $\Gamma$is a non-negative weight propagated down as described in proposition 2. The evaluate function returns the value of $M_i$at $z$. Notice that the model's parameters $\Psi_i$are hidden by the abstract type. In practice a mutator may be provided to set them and a selector provided for readout. The names Estep and Mstep follow the EM literature even though FGMs are not restricted to maximum-likelihood estimation of probability functions and densities.

Given a sequence of values $z_1,\ldots,z_m$, and nonnegative multiplicities $\Gamma_1,\ldots,\Gamma_m$the Estep and Mstep procedures address the following optimization problem:


\begin{displaymath}
\overline{\Psi}_i = \mathrm{arg max}_{{\Psi^\prime}_i} \prod_{j=1}^m M_i^{\Gamma_j}(z_j\vert{\Psi^\prime}_i)
\end{displaymath} (6)

which generalizes Eq. 5. The Estep procedure is called for each $(z_i,\Gamma_i)$and then Mstep is called to reestimate the model's parameters. For some models the result is the unique optimum parameter set, but in general the process is repeated. For example, if the $z_i$are letters of the alphabet, the $\Gamma_i$are positive frequencies, and the model's value is a probability for each letter, then Estep need only record the $\Gamma$values provided and Mstep can easily determine the unique optimal model. That is:


\begin{displaymath}
P(i) = \frac{\Gamma_i}{\sum_j \Gamma_j}
\end{displaymath}

We point out that if the probability of symbol $i$is zero prior to reestimation, then $\gamma_i$and therefore its reestimate $P(i)$are zero. Such parameters in this simple discrete model are then effectively disabled since their corresponding choice of alphabet symbol will always have probability zero.

A simple unique solution also exists for the normal density and others. In the discrete example above the model is constrained to produce values which sum to one over the alphabet. Normal densities are constrained to integrate to one. These constraints are obviously important so that the overall optimization problem has a bounded solution. But it is important to observe that their precise nature is irrelevant and invisible to the FGM framework. In particular there is no need for a model to represent a probability function There is also no need to restrict one's attention to models for which unique optimum solutions are readily available. As noted earlier, any progress made for even one model, will translate to progress in the overall FGM optimization.

A choice $C_i$is represented by an abstract type for which the following operations are defined:

  1. ${C}_{i}\!:\!\mbox{\it evaluate}(j)$
  2. ${C}_{i}\!:\!\mbox{\it Mstep}(\Gamma[\ldots])$

Here $j$is the index of an edge and the evaluate function returns its value under $C_i$given the model's current parameters $\Upsilon_i$. The Mstep procedure is passed the vector $(\Gamma_1,\ldots,\Gamma_{\vert C_i\vert})$and addresses the following optimization problem:


\begin{displaymath}
\overline{\Upsilon}_i = \mathrm{arg max}_{{\Upsilon^\prime}...
...1}^{\vert C_i\vert} C_i^{\Gamma_j}(j\vert{\Upsilon^\prime}_i)
\end{displaymath} (7)

If $C_i$is a probability function then the problem is exactly that of Eq. 6 and an exact solution is available. In both cases the solution is mathematically optimal but other problem constraints might be incorporated into $C_i$or $M_i$ that sometimes forbid it. For example, one might prohibit probabilities from falling below some fixed threshold. This amounts to a change in parameterization and we will see in a later section that this capability may be used to alter the meaning of optimal in useful ways. Note that we might instead have treated a choice model as an observation model over the natural numbers but decided that their roles are sufficiently distinct to warrant separate designs.

We now return to our discussion of model applications and begin by describing the information that must be associated with these structures. An observation application structure ${\cal M}_i$contains:

  1. ${\cal M}_{i}\!:\!\mbox{\it model}$-- the integer index of the observation model associated with this application.

  2. ${\cal M}_{i}\!:\!\mbox{\it selector}$-- the integer index of associated selector function.

  3. ${\cal M}_{i}\!:\!\mbox{\it value}$-- a real number representing the most recent model application evaluation.

  4. ${\cal M}_{i}\!:\!\mbox{\it$\Gamma$}$-- an accumulator to which values are added by the possibly many places within the overall model that refer to ${\cal M}_i$. In the case of primitive models we may then call Estep a single time. For FGMs the value corresponds to $\gamma_e$of proposition 2 and is used to propagate $\gamma$values down the recursive hierarchy.

A choice application structure ${\cal C}_i$contains:

  1. ${\cal C}_{i}\!:\!\mbox{\it value}[\ldots]$-- a vector of real numbers representing the choice model's value at each integer of its domain based on its current parameters.

  2. ${\cal C}_{i}\!:\!\mbox{\it$\Gamma$}[\ldots]$-- a vector accumulator to which values are added by the possibly many places within the overall model that refer to ${\cal C}_i$.

No model variable is needed since indexing corresponds to that of the choice models themselves.

Recall that the mathematical definition 1 of an FGM associates edges directly with elements of $M,C,S$via functions $m,c,s$- there are no model application structures. While the FGM along with its associated model application structure lists might be generated incrementally, it is simpler to imagine that the FGM is first specified mathematically and then compiled into a form that includes these lists. This compiled form is denoted $\hat{F}$and merges the $M,C,S$sets of each FGM into single densely enumerated sets. In compiled form FGM edges do not point directly to observation models but rather to observation model application structures. The $m$ function is replaced by $m_a$to effect this change.

We present evaluation, expectation step, and maximization step algorithms for compiled FGMs. These are not recursive functions but rather operate by traversing the global model application lists built during compilation.

Before any of these are called the FGM must be initialized. This procedure $\mbox{\it initialize}(\hat{F})$consists of setting the $\Gamma$variables to zero in all choice model application structures and then evaluating each choice model and recording its value there.

Evaluating an FGM and performing an expectation step are carried out by bottom-up (reverse order) and top-down (in order) traversals respectively of the model application lists. Evaluation returns a single value for $\hat{F}$given observation $x$and also records the value of all subsidiary models within the observation model application structures.

Algorithm 4   ` =13`^M=13 function $\mbox{\it evaluate}(\hat{F}, x)$ for each ${\cal M}_i$in descending index order if ``primitive'' $j = {\cal M}_{i}\!:\!\mbox{\it selector}$ $k = {\cal M}_{i}\!:\!\mbox{\it model}$ ${\cal M}_{i}\!:\!\mbox{\it value} = {M}_{k}\!:\!\mbox{\it evaluate}(S_j(x))$ else ``an FGM'' ${\cal M}_{i}\!:\!\mbox{\it value} = $result of algorithm 1 return ${\cal M}_{1}\!:\!\mbox{\it value}$

where algorithm 1 refers to the appropriate observation and choice application structure value variables rather than computing $C(e\vert\Upsilon)$and $M(e,x\vert\Psi)$.

As for primitive observation models the FGM expectation step procedure accepts a $\Gamma$argument. Supplying a value of $1$for each $x$ observed corresponds to the optimization problem of Eq. 5. In general these $\Gamma$values generalize the problem by appearing as exponents to the FGM's value as in % latex2html id marker 2078
$Eq.~\ref{improvemanymult}$. The FGM is first evaluated so that the observation model application structures contain current values. Their $\Gamma$values are then set to zero except for the first which corresponds to the top-level FGM and is set to the function's $\Gamma$argument. The main loop proceeds top-down to propagate $\Gamma$contributions to lower observation model application structures. Eventually primitive structures are reached and a primitive expectation step is performed. The FGM's value is returned as a convenience since it was necessary to compute it.

Algorithm 5   ` =13`^M=13 function $\mbox{\it Estep}(\hat{F}, x, \Gamma)$ $\mbox{\it {evaluate}}(\hat{F},x)$ for each ${\cal M}_i$ if $i = 1$, ${\cal M}_{i}\!:\!\mbox{\it$\Gamma$} = \Gamma$, else ${\cal M}_{i}\!:\!\mbox{\it$\Gamma$}= 0$ for each ${\cal M}_i$in increasing index order if ``primitive'' $j = {\cal M}_{i}\!:\!\mbox{\it selector}$ $k = {\cal M}_{i}\!:\!\mbox{\it model}$ ${M}_{k}\!:\!\mbox{\it estep}(S_j(x), {\cal M}_{i}\!:\!\mbox{\it$\Gamma$})$ else ``an FGM'' Set $\{\gamma_e\}$using algorithms 1,2,3 for each edge $e$ ${\cal M}_{m_a(e)}\!:\!\mbox{\it$\Gamma$} \,+\!=\gamma_e \cdot {\cal M}_{i}\!:\!\mbox{\it$\Gamma$}$ ${\cal C}_{c_1(e)}\!:\!\mbox{\it$\Gamma$}[c_2(e)]\,+\!=\gamma_e\cdot{\cal M}_{i}\!:\!\mbox{\it$\Gamma$}$ return ${\cal M}_{1}\!:\!\mbox{\it value}$

The FGM maximization procedure maximizes the primitive observation models and all choice models. The FGM is then reinitialized.

Algorithm 6   ` =13`^M=13 function $\mbox{\it Mstep}(\hat{F})$ for each $M_i$ ${M}_{i}\!:\!\mbox{\it maximize}()$ for each $C_i$ ${C}_{i}\!:\!\mbox{\it maximize}({\cal C}_{i}\!:\!\mbox{\it$\Gamma[\ldots]$})$ $\mbox{\it {initialize}}(\hat{F})$

Observe that we do not specify that the models are maximized in any particular order since if their parameter sets are disjoint so is the act of maximizing them. Our notation thus-far has suggested that the parameter sets are in fact independent but in general they need not be. For example, two models might share the same parameters but compute two different functionals depending on them. In this case the two corresponding optimization subproblems of definition 2 are simply combined into one. From an implementation viewpoint this kind of situation can be handled entirely at the model type level through communication between them that is invisible at higher levels. So the fact that algorithm 6 specifies no order for maximization does not imply that we are limited to the case of disjoint parameter sets.

Analysis of the algorithms above is straightforward from the structure of the compiled FGM $\hat{F}$. Let $N_E$denote the sum over nonprimitive observation model applications, of the number of edges in the FGM corresponding to each. Then both evaluation and expectation step processing require $O(N_E) + T_p$time where $T_p$denotes the time needed to perform the required primitive observation model type operations. These are described by the final entries in the list of observation model applications. When these primitive operations are $O(1)$time the overall complexity is just $O(N_E)$since their number cannot exceed $N_E$. Each choice model is evaluated once during initialize. The maximize operation time-complexity is just that required to perform a maximization step for each observation and choice model in the system. Ignoring the internal requirements of the observation and choice models, it is obvious that $\hat{F}$requires $O(N_E)$space.

The algorithms above are conceptually straightforward, but an important numerical issue must be faced in their implementation. As one sweeps through the DAG, the $\alpha$and $\beta$variables can easily become too large or small to represent as conventional floating point values. This is true even of the final FGM value. If the FGM is a probability model, exponent-range underflow is the issue and corresponds to an extremely small probability of generating any particular observation. For complex problems this behavior is the rule not the exception since element probabilities in general decline exponentially with dimension. Logarithmic representation is one solution but the authors have also explored a floating point representation with extended exponent range as reported in [23]. Because $\gamma$variables are normalized by $F(x)$they may be adequately represented using standard floating point.

We now briefly discuss the computational complexity of FGM maximization. A restricted class of FGMs is identified for which maximization is shown to be NP-complete. Members of this class have constant choice functions that assume value $1$. There are two selection functions $s_0$and $s_1$that assume constant values $0$ and $1$respectively. Notice that there is no dependency on any observation. Each observation model has a single parameter bit $b$ and given boolean argument $x$assumes value $(b \wedge \bar{x}) \vee
(\bar{b} \wedge x)$. The FGM then assumes a bounded integral value and the maximization problem is well-posed. The corresponding decision problem is: does there exist a set of parameter bits such that the FGM's value is greater than a given integer $i$? This problem is clearly in NP since given a set of parameter bits, FGM evaluation, which runs in linear time, may be used to answer the question. We will refer to this setting as the LFGM (logical FGM) problem.

Theorem 2   LFGM is NP-complete.

proof: The NP-complete SAT problem is easily reduced to LFGM by first associating an observation model with each variable. If the variable $v,\bar{v}$occurs in a clause then selector $s_1$or $s_0$is used respectively. Clauses gives rise to trivial FGMs in which each term corresponds to a distinct edge from source to sink with the appropriate observation model and selection function attached. The FGMs for each clause are then concatenated to form the final FGM. The set of clauses are then satisfied if and only if the FGM assumes a value greater than zero. The reduction's complexity is linear whence LFGM is NP-complete.$\;\Box$

It is possible to construct reductions like that above where the observation models are continuous not discrete functions; but further discussion of the complexity of continuous optimization is beyond the scope of this paper. These continuous formulations correspond to stochastic continuous embedding approaches to SAT and represent an interesting area for future work.

This concludes our development of FGMs as general mixture-forms and in closing we remark that our framework is certainly open to additional generalization and refinement. For example one might introduce choice models that also depend on the observation, or replace the $m,c,s$ edge association functions with distributions. Many such variations are possible but our goal was a development somewhere between empty generalization and over specialization.


FGM-based Probability Models

In this section we consider specializations of Finite Growth Models such that their value represents a probability. The associated DAG and attached models may then be viewed as an acyclic automaton directing the stochastic generation of objects. Stochastic modeling is the original motivation for FGMs and much of our nomenclature is drawn from this viewpoint.

The first step is to specialize the selection functions of definition 1 to projections. That is, functions that select some subset of the dimensions of tuple observation $x$. A projection is characterized by a subset $g$of the dimension indices of ${\cal X}$. We associate some subset $g_e$with every edge $e$of the FGM and then denote by $\pi_{g_e}(x)$, the projection of observation tuple $x$corresponding to selection of the components in $g_e$. This then is the selection function associated with $e$. For some edge $e$ suppose $g_e = \{2,4\}$. Then $\pi_{g_e}(x)$is a tuple with two components which are equal in value to the second and fourth components of $x$. We assume for now that $g_e \neq \emptyset$but will later see that this is unnecessary. A composition of projections, applied to $x$, is again a projection characterized by some dimensional subset; whence it is straightforward to define equivalence for the purpose of identifying truly distinct model applications for recursively combined FGMs. The use of projections corresponds to the generation of observations in arbitrary groupings and orderings - a fact we will rely on later in our discussion of stochastic transduction.

Next we assume that all choice and observation models are probability functions or densities. Each $C_i$then corresponds to a stochastic choice among edges to follow. Each $M_i$corresponds to stochastic generation of some portion of $x$(since our selectors are projections).

Finally we specialize the $g_e$subsets so that along any source-sink path their intersection is void and their union is the complete set of dimensions. Each path then represents one way in which $x$might be stochastically generated or grown and corresponds to a permutation of the dimensional indices $1,\ldots,d$. The ``finite growth model'' framework is so named because we imagine that complete observations are grown, starting from the empty set, in many possible orderings, as specified by the model's underlying graph. The value of each source-sink path through the FGM is the probability of following that path and generating $x$. The FGM's value is then a probability function $P(x\vert\Phi)$on ${\cal X}$, i.e. its sum/integral is one. When it is necessary to make clear that we are referring to an FGM with the restrictions above, we will use the term stochastic finite growth model (SFGM).

After algorithm 1 has run, $\alpha_n$gives $P(x\vert\Phi)$. However the other $\alpha$values have meaning as well. In general $\alpha_i$is the probability of arriving at $v_i$ and observing that part of $x$corresponding to the paths between $v_i$and the source. Note that because each source-sink path corresponds to a permutation of the observation tuple, every path from the source to $v_i$must generate the same observation components -- possibly however in different orders. Dually, $\alpha_i$is the probability that random operation of the machine encounters $v_i$, while at the same time generating those observation components accounted for by the source to $v_i$paths. Algorithm 1 sums over all paths through the FGM but is easily modified to find instead a single optimal path. This is referred to as the Viterbii decode of the FGM and answers the question: what is the single most likely explanation for the observation?

Algorithm 2 also computes $P(x\vert\Phi)$as $\beta_1$. In general $\beta_i$is the probability of observing that part of $x$ corresponding to the paths between $v_i$and the sink, given passage through $v_i$. Dually, $\beta_i$is the probability that the machine passes through $v_i$and proceeds on to generate those observation components accounted for by the $v_i$to sink paths. Denoting by $L$the part of $x$corresponding to paths from the source to $v_i$, and by $R$the part of $x$corresponding to paths from $v_i$to the sink, $\beta_i = P(R\vert v_i)$and $\alpha_i =
P(L,v_i)$. But $P(R\vert v_i) = P(R\vert L,v_i)$because an SFGM is a Markov process. So $\alpha_i \cdot \beta_i = P(L,v_i,R)$, i.e. the probability of generating $x$and passing through $v_i$.

Observe that the meaning of $\beta_i$is not simply a time-reversed interpretation of $\alpha_i$. This is because an SFGM has an intrinsic directionality. In-bound DAG edge probabilities need not sum to one so simply reversing edge direction does not leave a well-formed SFGM.

For SFGMs the $\gamma_e$values computed by algorithm 3 are the probabilities $P(e\vert x) = P(e,x) / P(x)$; where `$e$' refers to the event of a transition over edge $e$. That is, $\gamma_e$gives the a posteriori probability of following edge $e$conditioned on observation (generation) of $x$. Given an SFGM and any $x$, the $\gamma_e$values must satisfy the following constraint which can in practice be used as an implementation self-check. It is a specialization of proposition 1 to SFGMs:

Proposition 3   Given a stochastic finite growth model, an observation with nonzero probability, and $\gamma$values computed by algorithms 1, 2, and 3, then:

  1. If