Eric Sven Ristad - Peter N. Yianilos
Research Report1 CS-TR-533-96
December, 1996 (revised April 1997)
Restricted to probabilistic form, FGMs correspond to stochastic automata that allow observations to be processed in many orderings and groupings - not just one-by-one in sequential order. As a result the parameters of a highly general form of stochastic transducer can be learned from examples, and the particular case of string edit distance is developed.
In the FGM framework one may direct learning by criteria beyond simple maximum-likelihood. The MAP (maximum a posteriori estimate) and MDL (minimum description length) are discussed along with the application of FGMs to causal-context probability models and unnormalized noncausal models.
Keywords:Expectation Maximization (EM), Hidden Markov Model (HMM), Baum-Welch Reestimation, Stochastic: Model, Grammar, Transducer, Transduction, Metric Learning, Maximum Likelihood (ML), Maximum a Posteriori (MAP), Minimum Description Length (MDL), String Edit Distance, String Similarity.
Hidden discrete-state stochastic approaches such as hidden Markov models (HMM) for time series analysis, stochastic context free grammars (SCFG) for natural language, and statistical mixture densities are used in many areas including speech and signal processing, pattern recognition, computational linguistics, and more recently computational biology. These are parametric models and are typically optimized using the Baum-Welch, inside-outside, and expectation maximization (EM) algorithms respectively. They share a common mathematical foundation and are shown to be instances of a single more general abstract recursive optimization paradigm which we refer to as the finite growth model framework (FGM) involving non-negative bounded functionals associated with finite directed acyclic graphs (DAG). These functionals need not be probabilities and an FGM need not correspond to a stochastic process.
With FGMs interesting new kinds of stochastic models may be designed, existing models may be improved, settings that are not strictly probabalistic in nature may be addressed, and it becomes easier to design and reason about a wide range of problems. This paper introduces and develops the FGM framework and then applies it to:
Graphical models [20,27] represent a branch of related work in which the independence structure of a set of random variables is the main focus. One can express such models using FGMs - as well as considerably more general settings such as that in which no two variables are independent (their graphical model is fully connected), but they are jointly constrained in interesting ways.
An FGM is a directed acyclic graph with a single source and sink
together with a collection of non-negative bounded weight functions
that are associated with edges - one per edge. The value of a
source-sink path is the product of the weights along it and the FGM's
value is the sum over all such paths of their values. Section
0.2 presents a formal development but we will sketch the
basic ideas here.
The weight function
associated with edge
accepts
an argument
and parameters
. The objective of FGM
optimization is to maximize the FGM's value over its parameters
which are assumed to be independent. The argument
is regarded
as fixed. The same weight function may be associated with several
edges but will in general assume different values since
is a
function of the edge to which it is attached.
If each edge has a distinct associated weight function then optimization
is immediately reduced to the task of independently maximizing the
weight functions in isolation. The interesting situation arises when
this is not the case. Here a change to a single parameter set
may affect weights throughout the DAG. The values of many source-sink
paths may be affected and an interaction between members of
arises, i.e. simple independent maximization is no longer possible.
Surprisingly it is still possible to decompose the optimization problem into certain independent subproblems. This is the key mathematical message of Baum-Welch and EM generalized to the FGM framework. However, even exact solution of the subproblems does not yield a maximum for the FGM. But a new combined parameter set does result that strictly improves it -- unless one has already converged to a limiting value. Iteration is then used to climb to such a limiting value. Unlike gradient descent with line search this approach confidently moves to a sequence of increasingly better points in parameter space and in many cases reduces to a strikingly simple and intuitive algorithm. The cost functions may themselves be FGMs and the computation of an improved parameter set simply depends upon the ability to solve the primitive maximization problems at the bottom of the recursive hierarchy.2
The mathematical idea behind this decomposition begins with a simple
equality having to do with the
of a sum of terms. Let
and define proportions
. Notice
. For any nonnegative vector
with unity sum:
Observe that the second term is
maximized when
since both are stochastic vectors.
In our setting each term
corresponds to a source-sink path, and
its value is not constant, but rather is a function of some parameter
set
. Optimizing the FGM is then the problem of maximizing
their sum over
. This is the same as optimizing their
sum; connecting the equality above to the problem of FGM optimization.
Accordingly we parameterize the equality by writing
and
where our objective is to maximize
over
.
The search for a maximum takes place iteratively. During each iteration we denote
the current parameters by
and vary
looking
for an improved set.
To express this iterative approach the equality is specialized by
defining
to be
evaluated at the current parameters
,
i.e.
. Then:
The right term may be ignored because any departure from
will decrease it whereby
and therefore
are increased.
We then focus on maximizing the left term which
corresponds to the
function of the HMM/EM literature.
Elements of the remaining summation correspond to source-sink paths
through the FGM's DAG and the proportions
are the relative
contribution of each path to the FGM's value.
Each term
is then a product of of weight functions and
breaks up into a sum of individual
terms. The proportions
are then distributed over them and the final step is to group
terms by weight function. The problem is then decomposed as required
into independent subproblems each involving a single weight function.
The argument above is made in much greater detail in section 0.2
and the formal definition of an FGM presented there is more involved
in two respects. First, there are two weight functions, not one, associated
with each edge. These are abstractions of the normally separate transition
and observation generation functions of a stochastic automaton.
Second, the
in our discussion above is defined to be some function
of an observation tuple
. A simple such function is that of coordinate
projection where, for example, the function's value might be
that of a single dimension of
.
The contribution of section 0.2 is the development of a framework that has nothing directly to do with stochastic models or probability in which the essential ideas of Baum-Welch and EM nevertheless apply. Our view is that the essential content of these ideas is one of decomposability of certain optimization problems that are captured in a somewhat general way by our definition of an FGM.
The early literature, especially theorem 2.1 of [3], reveals an awareness that the mathematics of HMMs apply beyond strictly probabalistic settings. This direction of generalization does not however appear to have been followed until now. Thinking seems to have been dominated by the motivating problem: the optimization of Markovian stochastic models. Because of the limited computational capabilities of the time and their highly mathematical viewpoint a great deal of emphasis was also placed on the primitive weight functions for which exact maximization may be performed in closed form, and those which give rise to unique global maxima. By contrast we devote very little attention to the the specific characteristics of these primitive functions and view them instead as an abstract types that by assumption support a set of required operations.
In section 0.3 the FGM framework is specialized to
the case of probabilistic models. The
above result from
projections that select one or more coordinate dimensions such that
along each source-sink path each dimension is selected exactly once.
Also, the weight functions are restricted to be probability functions
or densities. It is not important that the dimensions occur in any
fixed order as one follows alternative source-sink routes; nor is it
required that the dimensions be selected one at a time. The
contribution of this outlook is that FGM-based stochastic models can
generate observations in many orderings and groupings - not just
one-by-one in sequential order as in HMMs - a capability
exploited later in our discussion of stochastic transducers.
Section 0.4 begins by considering the case of infinite stochastic processes. Viewed generatively these models produce output of unbounded size but are used in practice to evaluate the probability of finite observations and are trained using a finite set of finite observations. Restricted in this way they correspond to FGMs that capture their truncated form. Moreover the conventional evaluation and learning algorithms associated with them arise naturally from this viewpoint.
Baum-Welch and EM are today regarded as maximum-likelihood parameter estimation techniques. A more modern learning theoretic perspective recognizes the important tension between model complexity and available training data. An important contribution of section 0.4 is the description of a construction that modifies the FGM associated with a stochastic model so that alternative criteria such as MDL or MAP can guide optimization. The optimization is still based on Baum-Welch/EM so we suggest that these should not be viewed as statistical ML techniques and that their broader utility is revealed by our more general functional optimization perspective.
The observation functions associated with each state in a conventional Hidden Markov model must satisfy an independence assumption: that their state dependency is limited to the current state (the Markov condition). Section 0.4 observes that conditioning on earlier portions of the observation sequence does not violate this assumption and demonstrates that using the FGM framework, it is easy to see that parameter reestimation for such context dependent HMMs decomposes into primitive context optimization problems. This generalizes Brown's use [7] of the previous speech window to condition generation of the current one. Thus the power of causal context models such as statistical language models may be added to the HMM framework and may lead to improved performance.
The section also presents another illustration that the reach of FGMs extends in a useful way beyond conventional and strictly stochastic models. Causal context models express the probability of an observation as a product of conditional probabilities such that each dimension is predicted once and does appear earlier as a conditioning variable. These restrictions are constrictive in settings such as image modeling where modeling each pixel based on its surrounding context seems natural. Simply relaxing the requirements that probabalistic modeling places on an FGM's source-sink projection functions proves that FGM optimization will improve even noncausal models like the image example above.
Stochastic models describe the manner in which an object is generated. This object is conventionally imagined to be something like an image, an audio recording, or a string of symbols. If instead one models pairs of such objects the focus shifts from describing the nature of an individual to uncovering the relationship between them. The pairs may be thought of as instances of the are similar concept and learning an effective model to describe them is an example of what the authors have called the metric learning paradigm.
Fixing the left element of the pair one might then ask what generative
sequences could explain the right. This is the essential idea of
-way stochastic transduction. Section 0.5
shows that a somewhat general automaton-based definition for
-way
transduction may be reduced to problems within the FGM framework. One
may view speech recognition as a transduction,
perhaps in several stages, and the application of learned transducers
to this field represents an interesting area for future work.
Perhaps the best known example of
-way transduction is that of string edit distance which is defined as the least costly way to
transform one string into another via a given set of weighted
insertion, deletion, and substitution editing operations. The
development in section 0.5.1 begins with a natural
reformulation into stochastic terms first noted by Hall and Dowling in
[12, P.390-1]. We refer to the result as stochastic edit
distance [24].
Our focus is on the interesting problem of learning insert, delete,
and substitute costs from a training corpus
of string pairs. The objective is to minimize the total
stochastic edit distance of the collection. In [24] the
authors give such a learning algorithm independent of the FGM
framework and report on experiments. Until now the costs which
parameterize edit distance algorithms have been prescribed not
learned. Section 0.5.1 of this paper describes the
reduction of this learning problem to the FGM optimization framework
and may be viewed as an adjunct to [24] providing a proof of
correctness.
The optimized costs are often quite different from the initial ones. It is in this sense that we learn them. In previous work, the costs have been stipulated after study of the problem domain. Even worse, maximally uninformative unit costs (i.e., Levenshtein distance) are sometimes stipulated when the problem domain is nontrivial. It is now possible to improve any educated guess, or start virtually from scratch from unit costs, to arrive at a set of optimized ones. Algorithms 7 and 8 of section 0.5.1 run in time proportional to the product of the two string lengths, matching the complexity of the conventional edit distance computation. Since edit distance has found widespread application our discovery of a simple way to learn costs may lead to improved performance in many problem areas.
Section 0.5.1 also observes that the notion of string edit distance may be strengthened so that the cost of an edit operation depends not only on its target but also on context. For example, such context-sensitive string edit distances allow the cost of inserting a symbol to depend on the symbol left-adjacent to the insertion site. The result remains an FGM on one therefore the ability to easily improve costs given a training corpus is retained.
Several kinds of hidden discrete-state stochastic models are in widespread use today. Hidden Markov Models (HMM) were introduced in the 1960's [2,3,1] and are typically applied in time-series settings when it is reasonable to assume that the observed data are generated or approximated well by the output of an automaton which changes state and generates output values stochastically. A matrix-based approach is typically used to formalize this notion, and [21] and [14] provide nice overviews. Normal (Gaussian) mixtures find application in statistical pattern recognition where items to be classified are represented as real-valued vectors. Stochastic context free grammars (SCFG) lead to a natural probabilistic interpretation of natural language parsing. Other model types exist, and are easily formed by combining and modifying the standard ones.
The situation is complicated by parameter tying; the practice of reducing the number of parameters in a model by equating two or more different sets of formal parameters to a single set. Observations are sometimes discrete and sometimes continuous. Entire models may be mixed together or recursively combined. Some states or transitions may be nonemitting meaning that no observation is generated. Finally, nonuniform time-series models may predict several future data points. All of these factors make it difficult to write down a single high-level framework with which to reason about and verify the correctness of a model's associated algorithms.
The contribution of FGMs is to move instead to a lower-level representation which is common to all of the models and complicating variations above. As noted earlier isn't really the original model which is represented, but rather its truncation to the given collection of finite observations. By understanding FGMs it is possible to reason about the others as mappings from their individual description language to an FGM given finite observations.
Section 0.6 relates the models above to FGMs by describing a reduction for each. As further illustration it considers the portfolio optimization problem and shows that the simple iterative algorithm reported in [9] for optimization arises by reducing the problem to an FGM.
The work described in this paper began in 1993 with our efforts to model handwriting. The focus then was on transduction, in an attempt to evaluate and learn a notion for similarity for the off-line case. This effort exposed many limitations of the conventional modeling outlook - problems that extend far beyond handwriting. To address these limitations we had to break several rules of conventional modeling only to ultimately discover that these rules are in many cases unnecessary artifacts of the history that led to HMMs and EM. After three years of generalization, and exercises in specialization, we arrived at the FGM outlook reported here.
In this section we develop the theory of Finite Growth Models (FGM) as a class of nonnegative functionals represented as annotated directed acyclic graphs. The next section focuses on the special case of probability models. Our formal definition of an FGM amounts to an intricate recursive data structure, in the computer science sense, that has been crafted to span a wide range of applications. As a result it may seem at first to be rather abstract and have little to do with stochastic modeling, but each feature is an abstraction of some characteristic required by a problem we considered. It is the mathematical engine at the center of these problems. In particular much of our development and notation parallels that of Hidden Markov Models (HMM).
Let
be a tuple observation with components
, and
denote the space of all such observations. We will not
specify the type of each component since for much of our development
it matters only that appropriate functions exists on the space of
values that a component may assume. Components may therefore
represent discrete values such as letters of the alphabet, continuous
values such as real numbers, or other more complex structures. Within
the observation tuple, component types need not be the same. In
simple time series probability-model settings, components do
have the same type, and position within the tuple corresponds to the
passage of time.
The choice value corresponding to an edge
is then
and
is denoted
for brevity where it is
understood that in this context
refers to
- but the subscript is sometimes
written for clarity.
The combined observation and choice parameters are
denoted
and
respectively, and
denotes the parameters of
.
The value of a source-sink path is the product of
the observation and choice functional values along it. The
value
of the FGM is the sum over all
source-sink paths, of the value of each.
A choice or observation model whose value is the constant
does
not affect an FGM's value and is therefore said to be
void.
Notice that since an FGM is itself a nonnegative parameterized functional, the definition above may be applied recursively to define an FGM whose constituent functionals are themselves FGMs.
Direct evaluation of an FGM's value by enumerating all source-sink
paths is of course not practical. Fortunately
may be
efficiently computed using dynamic programming. To express this
precisely requires some notation. Let
be a
topologically sorted vertex listing. Following the convention of the
HMM literature, we define corresponding real variables
. The sets of edges into and out
from a vertex
are denoted by
and
respectively. Using the notation above, the contribution of an edge
to the value of a path containing it, is
The product of these contributions is the probability of
the path. The complete dynamic program is then given by the following
simple algorithm:
This may be recognized as the forward step of the Baum-Welch HMM forward-backward algorithm [2,1,21] adapted to the
more general FGM setting. Following execution
has value
.
Algorithm 1 is said to evaluate the FGM using its current set of parameters. The optimization problem we consider seeks a parameter set that maximizes the FGM's value
but we will be content to find locally optimal solutions,
and in some cases to simply make progress. Here
is fixed and it
is the FGM's parameters that are varied.
The end-to-end concatenation of two FGMs clearly results in another
whose value is the product of the two, so that given a set of
observations
, the optimization problem
is really just an instance of Eq. 4. Viewing
as training examples the optimization then learns a
parameter set, or in the language of statistics and assuming the FGM
corresponds to a probability model, estimates its parameters.
Our approach to this problem extends the idea of Baum-Welch to FGMs. Several years following the development of HMMs, essentially the same mathematical ideas expressed in the their algorithm were rediscovered as the expectation maximization (EM) algorithm for mixture densities [11,22]. This work focused on parameter estimation for mixture densities, and has had considerable influence on a somewhat independent community. In what follows we will use the phrase Baum-Welch/EM to refer to the adaptation of these results to the FGM setting.
The HMM and EM literature deals with probabalistic models, and Baum-Welch/EM is presented as a method for reestimating their parameters. Our mathematical contribution is the recognition that the method is not essentially a statement about probability but rather should be viewed as an iterative approach to the maximization of arbitrary parameterized nonnegative functionals of a certain form. The form consists of a sum of terms each one of which is a product of subterms. An FGM's DAG is a representation of these forms which, for many with appropriate structure, has the advantages of succinct representation and computational economy. In the interesting case, the subterms are partitioned or grouped into families sharing common parameters. The approach then consists of focusing on each family in isolation thus localizing the problem. Individual family maximization problems are constructed that include certain weighting factors. These factors are constant with respect to the family's optimization but are computed using the entire DAG and are therefore influenced by all families. Thus global information is used to decompose the original problem - localizing it to individual families. If parameters are found that improve any of the individual family problems, the FGM's value will strictly increase. The method is iterative since even exact maximization of every family's problem may not maximize the FGM's value. Instead the process is repeated.
To formalize this idea of simply making progress with respect to a given maximization problem, the notation:
is introduced. Notice that the improvement need not be strict and that the result is a set of values.
Our definition of an FGM is clearly an abstraction of stochastic modeling. We remark that it is possible to start from a definition that is yet more general and in a sense simpler in which only choice models are used and a variable number may be attached to each edge. But then connecting our results to the mainly stochastic problems we are interested in becomes more complex, and it is more difficult to see the connection between our generalization and the original literature.
We therefore develop Baum-Welch/EM for FGMs starting from a lemma that is stated so as to illustrate its correspondence with the EM literature. It is a statement about improper mixture densities, i.e. where the whole and its constituent parts are not necessarily probabilities. This is relevant to FGMs because they may be regarded as immense mixtures of contributions from every source-sink path. We follow earlier lines in its proof without requiring proper densities. The lemma may also be derived starting from theorem 2.1 of [3] in its most general form.
proof: We begin by establishing the identity:

where
is viewed as a discrete random variable distributed
as
and
denotes expectation
with respect to it. The point is merely that
is independent
of
since the former depends on
not
.
Next with
we have
since
. So:

It follows from Jensen's inequality that the second
summation is minimized by
. This can also be
seen by recognizing it as
, where
is the Kullbak-Leibler distance, and
denotes entropy (see [10]). Thus
maximizing the first term, written
in the
literature, will surely increase
and hence
. But:
and the right side is recognized as the object of
the
operator in the lemma's statement.![]()
The conditional expectation operator
can cause
confusion but we use it to make clear the correspondence with the EM
literature. Our discussion in section 0.1 starting with
Eq. 1 represents an alternative approach that is free of
explicit mention of expectation and probability.
To apply lemma 1 to FGM optimization we introduce two
more simple algorithms. The first computes
by sweeping
through the DAG's vertices in reverse rather than forward order,
establishing the values of a new set of real variables
:
This corresponds to the backward step of the forward-backward algorithm. After
it has run,
equals
, and therefore
.
We denote by
the source-sink paths
through
, and by
the set of edges along
a particular path
, and by
the set of paths that include
edge
. The contribution of a single path
to the
overall value of the FGM is denoted
and given by:
So that:
Next define
, and
then for each edge
define:
Notice that by definition the
arise from a partition of the
set of all source-sink paths whence it is clear that:
The
and
variables may be used to compute these
via a third simple dynamic program corresponding to the expectation step
of EM:
where either
or
may be substituted for
. As a practical matter algorithms 3 and
2 may be combined.
We are now in position to state precisely how the objective of optimizing an FGM is decomposed into smaller problems.
where the
are computed based on
.
The exact Baum-Welch/EM reestimate is formed by replacing
the
operations with
. By
and
we mean the set valued inverses of functions
and
.
The exact form corresponds to the historical definition of the Baum-Welch/EM reestimate. However, since this paper's focus is on decomposition, and not mathematical issues relating to particular choice and observation models, we will use ``Baum-Welch reestimate'' to refer to our inexact form defined above. This allows our framework to include models for which exact solution is not convenient. Also it is interesting to note that one may choose to not reestimate one or more models, and still improve the overall FGM.
proof: We write:
which makes clear the correspondence between the FGM
and lemma 1. Now by their definitions
so from the lemma we
have:
The double summations above enumerate every edge along every path though the FGM. To complete the proof we have only to enumerate them in a different order:
Recall
so:
and the parenthesized terms are
the independent subproblems of definition 2.![]()
We now change our focus to computational and data structural issues. The first part of our discussion deals with the intricacies of recursion. That is, when an FGM invokes another FGM as one of its observation models. Here, proper algorithm and data structure design can lead to polynomial rather than exponential complexity as illustrated by the case of stochastic context free grammars described in section 0.6.2. We will confine ourselves to finite recursion. The second issue has to do with the arithmetic demands of FGM computation.
The idea behind reducing the complexity of recursive FGM operations is a simple one: avoid duplicate work. If a particular combination of an observation model and selection function occurs multiple times, the combination should only be evaluated once. Since choice models do not depend on the observation, each should be evaluated only once. The computation is then ordered so that when an edge is considered, its choice and observation functions have already been evaluated and their values may be looked up in constant time.
To formalize this idea recall that
is defined as
. Notice that edge
appears as a function argument. We say
if their definitions are the same. Each equivalence
class is referred to as a model application and expressing
algorithms in terms of these classes avoids duplicate work. If for
two edges
then by this definition
- even if
is the same function as
.
For this reason we adjust the definition to account for equivalences
that may exist among the
.
This adjustment is subtle and we therefore illustrate it by example.
Suppose that the observation model associated with an edge
of
the top level FGM is itself an FGM
and that selection function
is used by
. Now within
assume that some edge
uses an observation model
with selection function
. Then to
evaluate
, selection functions
and
are composed. Now
focus on another top level edge
to which an FGM
is
attached using selection function
. Within
consider an edge
using the same observation model
as does
, but in
combination with selection function
. To evaluate
here
and
are composed. So
should only be evaluated once if
and
are the same function. This
may be the case despite the fact that
and
are not. Thus,
whenever possible, equivalence classes should be defined so as to
account for possibly equivalent compositions of selection functions.
This is easily implemented for the class of projection selection
functions introduced in the next section.
Mathematically definition 1 states that each edge has an
attached observation model via mapping
. Computationally the edge
instead selects (in practice points to) a model application structure
which identifies the observation model and the corresponding selection
function. We denote the set of observation model applications by
. Inclusion within the recursive hierarchy induces a
partial ordering on this set where it is important to realize that a
given model application may be referenced at more than one recursive
level. We extend this ordering by specifying that all primitive
observation models are dominated by any FGM. In what follows we will
then assume that
is indexed in topological order so that
the top-level FGM is first and the primitive models form the list's
end.
We also introduce choice model application structures
. Since choice models do not depend on
or involve a
selection function, these structures are in 1:1 correspondence with
the choice models themselves and are identically indexed. Many
references may exist to a choice model, and its corresponding
application structure avoids redundant model evaluation by recording
the model's current value. It also accumulates
contributions
during expectation step processing so that a single call may be made
to maximize the model. This is explained later in greater detail.
If an FGM
is a recursive child of a parent
, one
alternative is to eliminate the recursion and expand the child inline within the parent by adding vertices and edges to its DAG.
If
is invoked along edge
between vertices
and
then
the first step of this inline expansion disconnects
from
and
replaces the invocation of
with a void model. The child
is
then inserted between the disconnected end and vertex
.
When evaluating
this inline expansion clearly isn't necessary
since one can evaluate
in isolation, record its value in
the model application structure, and in constant time refer to that
value where needed in
. The same is true for parameter
reestimation although a brief argument is needed.
proof: Let
connect vertices
and
and let
denote some
source-sink path of
and
its weight.
If
is expanded inline
the value of
will be
.
In isolation it is
.
But
from which the proposition follows.![]()
So if the same model application
is pointed to by many FGM
edges, the right computational approach is to first accumulate their
values and then to process
. Here we emphasize that
this approach is more efficient and mathematically identical to
inline expansion. So if at the bottom of the recursion the required
problems are solved, then at all higher levels they are too.
An observation model
that is not an FGM is said to be primitive,
and is represented by an abstract type for which the following
operations are defined:
Here
is an element of the range of a selection function and
is a non-negative weight propagated down as described in
proposition 2. The evaluate function returns the value
of
at
. Notice that the model's parameters
are
hidden by the abstract type. In practice a mutator may be provided
to set them and a selector provided for readout. The names Estep and Mstep follow the EM literature even though FGMs
are not restricted to maximum-likelihood estimation of probability
functions and densities.
Given a sequence of values
, and nonnegative multiplicities
the Estep and Mstep procedures
address the following optimization problem:
which generalizes Eq. 5. The Estep
procedure is called for each
and then Mstep is
called to reestimate the model's parameters. For some models the
result is the unique optimum parameter set, but in general the process
is repeated. For example, if the
are letters of the alphabet,
the
are positive frequencies, and the model's value
is a probability for each letter, then Estep need only record the
values provided and Mstep can easily determine the
unique optimal model. That is:
We point out that if the probability of symbol
is zero
prior to reestimation, then
and therefore its reestimate
are zero. Such parameters in this simple discrete model are then
effectively disabled since their corresponding choice of alphabet
symbol will always have probability zero.
A simple unique solution also exists for the normal density and others. In the discrete example above the model is constrained to produce values which sum to one over the alphabet. Normal densities are constrained to integrate to one. These constraints are obviously important so that the overall optimization problem has a bounded solution. But it is important to observe that their precise nature is irrelevant and invisible to the FGM framework. In particular there is no need for a model to represent a probability function There is also no need to restrict one's attention to models for which unique optimum solutions are readily available. As noted earlier, any progress made for even one model, will translate to progress in the overall FGM optimization.
A choice
is represented by an abstract type for which
the following operations are defined:
Here
is the index of an edge and the evaluate function
returns its value under
given the model's current parameters
. The Mstep procedure is passed the vector
and addresses the following optimization
problem:
If
is a probability function then the problem is
exactly that of Eq. 6 and an exact solution is
available. In both cases the solution is mathematically optimal but
other problem constraints might be incorporated into
or
that sometimes forbid it. For example, one might prohibit
probabilities from falling below some fixed threshold. This amounts
to a change in parameterization and we will see in a later section
that this capability may be used to alter the meaning of optimal
in useful ways. Note that we might instead have
treated a choice model as an observation model over the natural
numbers but decided that their roles are sufficiently distinct to
warrant separate designs.
We now return to our discussion of model applications and begin by
describing the information that must be associated with these
structures. An observation application structure
contains:
A choice application structure
contains:
No model variable is needed since indexing corresponds to that of the choice models themselves.
Recall that the mathematical definition 1 of an FGM
associates edges directly with elements of
via functions
- there are no model application structures. While the FGM
along with its associated model application structure lists might be
generated incrementally, it is simpler to imagine that the FGM is first
specified mathematically and then compiled into a form that
includes these lists. This compiled form is denoted
and merges
the
sets of each FGM into single densely enumerated sets. In
compiled form FGM edges do not point directly to observation models
but rather to observation model application structures. The
function is replaced by
to effect this change.
We present evaluation, expectation step, and maximization step algorithms for compiled FGMs. These are not recursive functions but rather operate by traversing the global model application lists built during compilation.
Before any of these are called the FGM must be initialized. This
procedure
consists of setting the
variables to zero in all choice model application structures and
then evaluating each choice model and recording its value there.
Evaluating an FGM and performing an expectation step
are carried out by bottom-up (reverse order) and top-down (in order)
traversals respectively of the model application lists.
Evaluation returns a single value for
given observation
and
also records the value of all subsidiary models within the observation
model application structures.
where algorithm 1 refers to the appropriate
observation and choice application structure value variables rather
than computing
and
.
As for primitive observation models the FGM expectation step procedure
accepts a
argument. Supplying a value of
for each
observed corresponds to the optimization problem of Eq. 5.
In general these
values generalize the problem by appearing
as exponents to the FGM's value as in
. The FGM
is first evaluated so that the observation model application
structures contain current values. Their
values are then set
to zero except for the first which corresponds to the top-level FGM
and is set to the function's
argument. The main loop
proceeds top-down to propagate
contributions to lower
observation model application structures. Eventually primitive
structures are reached and a primitive expectation step is performed.
The FGM's value is returned as a convenience since it was necessary
to compute it.
The FGM maximization procedure maximizes the primitive observation models and all choice models. The FGM is then reinitialized.
Observe that we do not specify that the models are maximized in any particular order since if their parameter sets are disjoint so is the act of maximizing them. Our notation thus-far has suggested that the parameter sets are in fact independent but in general they need not be. For example, two models might share the same parameters but compute two different functionals depending on them. In this case the two corresponding optimization subproblems of definition 2 are simply combined into one. From an implementation viewpoint this kind of situation can be handled entirely at the model type level through communication between them that is invisible at higher levels. So the fact that algorithm 6 specifies no order for maximization does not imply that we are limited to the case of disjoint parameter sets.
Analysis of the algorithms above is straightforward from the structure of
the compiled FGM
. Let
denote the sum over nonprimitive
observation model applications, of the number of edges in the FGM
corresponding to each. Then both evaluation and expectation step
processing require
time where
denotes the time
needed to perform the required primitive observation model type
operations. These are described by the final entries in the list of
observation model applications. When these primitive operations are
time the overall complexity is just
since their number
cannot exceed
. Each choice model is evaluated once during initialize. The maximize operation time-complexity is just
that required to perform a maximization step for each observation and
choice model in the system. Ignoring the internal requirements of the
observation and choice models, it is obvious that
requires
space.
The algorithms above are conceptually straightforward, but an
important numerical issue must be faced in their implementation. As
one sweeps through the DAG, the
and
variables can
easily become too large or small to represent as conventional floating
point values. This is true even of the final FGM value. If the FGM
is a probability model, exponent-range underflow is the issue and
corresponds to an extremely small probability of generating any
particular observation. For complex problems this behavior is the
rule not the exception since element probabilities in general decline
exponentially with dimension. Logarithmic representation is one
solution but the authors have also explored a floating point
representation with extended exponent range as reported in
[23]. Because
variables are normalized by
they may be adequately represented using standard floating
point.
We now briefly discuss the computational complexity of FGM
maximization. A restricted class of FGMs is identified for which
maximization is shown to be NP-complete. Members of this class have
constant choice functions that assume value
. There are two
selection functions
and
that assume constant values
and
respectively. Notice that there is no dependency on any
observation. Each observation model has a single parameter bit
and given boolean argument
assumes value
. The FGM then assumes a bounded integral value
and the maximization problem is well-posed. The corresponding
decision problem is: does there exist a set of parameter bits such
that the FGM's value is greater than a given integer
? This problem
is clearly in NP since given a set of parameter bits, FGM evaluation,
which runs in linear time, may be used to answer the question. We will
refer to this setting as the LFGM (logical FGM) problem.
proof: The NP-complete SAT problem is easily reduced to LFGM by first
associating an observation model with each variable. If the variable
occurs in a clause then selector
or
is used
respectively. Clauses gives rise to trivial FGMs in which each term
corresponds to a distinct edge from source to sink with the
appropriate observation model and selection function attached. The
FGMs for each clause are then concatenated to form the final FGM. The
set of clauses are then satisfied if and only if the FGM assumes a
value greater than zero. The reduction's complexity is linear whence
LFGM is NP-complete.![]()
It is possible to construct reductions like that above where the observation models are continuous not discrete functions; but further discussion of the complexity of continuous optimization is beyond the scope of this paper. These continuous formulations correspond to stochastic continuous embedding approaches to SAT and represent an interesting area for future work.
This concludes our development of FGMs as general mixture-forms and in
closing we remark that our framework is certainly open to additional
generalization and refinement. For example one might introduce choice
models that also depend on the observation, or replace the
edge association functions with distributions. Many such variations
are possible but our goal was a development somewhere between
empty generalization and over specialization.
In this section we consider specializations of Finite Growth Models such that their value represents a probability. The associated DAG and attached models may then be viewed as an acyclic automaton directing the stochastic generation of objects. Stochastic modeling is the original motivation for FGMs and much of our nomenclature is drawn from this viewpoint.
The first step is to specialize the selection functions of definition
1 to projections. That is, functions that select some
subset of the dimensions of tuple observation
. A projection is
characterized by a subset
of the dimension indices of
. We
associate some subset
with every edge
of the FGM and then
denote by
, the projection of observation tuple
corresponding to selection of the components in
. This then
is the selection function associated with
. For some edge
suppose
. Then
is a tuple with two
components which are equal in value to the second and fourth
components of
. We assume for now that
but
will later see that this is unnecessary. A composition of
projections, applied to
, is again a projection characterized by
some dimensional subset; whence it is straightforward to define
equivalence for the purpose of identifying truly distinct model
applications for recursively combined FGMs. The use of projections
corresponds to the generation of observations in arbitrary groupings
and orderings - a fact we will rely on later in our discussion of
stochastic transduction.
Next we assume that all choice and observation models are probability
functions or densities. Each
then corresponds to a stochastic
choice among edges to follow. Each
corresponds to
stochastic generation of some portion of
(since our
selectors are projections).
Finally we specialize the
subsets so that along any source-sink
path their intersection is void and their union is the complete set of
dimensions. Each path then represents one way in which
might be
stochastically generated or grown and corresponds to a permutation
of the dimensional indices
. The ``finite growth model''
framework is so named because we imagine that complete observations
are grown, starting from the empty set, in many possible orderings, as
specified by the model's underlying graph. The value of each
source-sink path through the FGM is the probability of following that
path and generating
. The FGM's value is then a probability
function
on
, i.e. its sum/integral is one.
When it is necessary to make clear that we are referring to an FGM with
the restrictions above, we will use the term stochastic finite growth model
(SFGM).
After algorithm 1 has run,
gives
. However the other
values have meaning as
well. In general
is the probability of arriving at
and observing that part of
corresponding to the paths
between
and the source. Note that because each source-sink
path corresponds to a permutation of the observation tuple, every
path from the source to
must generate the same observation
components -- possibly however in different orders. Dually,
is the probability that random operation of the machine
encounters
, while at the same time generating those observation
components accounted for by the source to
paths. Algorithm
1 sums over all paths through the FGM but is easily
modified to find instead a single optimal path. This is referred to
as the Viterbii decode of the FGM and answers the question:
what is the single most likely explanation for the observation?
Algorithm 2 also computes
as
. In
general
is the probability of observing that part of
corresponding to the paths between
and the sink, given passage
through
. Dually,
is the probability that
the machine passes through
and proceeds on to
generate those observation components accounted for by the
to
sink paths. Denoting by
the part of
corresponding to paths
from the source to
, and by
the part of
corresponding to
paths from
to the sink,
and
. But
because an SFGM is a
Markov process. So
, i.e. the
probability of generating
and passing through
.
Observe that the meaning of
is not simply a time-reversed
interpretation of
. This is because an SFGM has an intrinsic
directionality. In-bound DAG edge probabilities need not sum to one
so simply reversing edge direction does not leave a well-formed SFGM.
For SFGMs the
values computed by algorithm 3 are
the probabilities
; where `
'
refers to the event of a transition over edge
. That is,
gives the a posteriori probability of following edge
conditioned on observation (generation) of
.
Given an SFGM and any
, the
values must satisfy the following
constraint which can in practice be used as an implementation self-check.
It is a specialization of proposition 1 to SFGMs: