David Loewenstern*
NEC Research Institute, 4 Independence Way, Princeton, NJ
08540
Phone: 6099512798
Fax: 6099512483
Email: loewenst@paul.rutgers.edu
and
Peter N. Yianilos
NEC Research Institute, 4 Independence Way, Princeton, NJ
08540
and
Department of Computer Science
Princeton University, Princeton, New Jersey 08544
Email: pny@research.nj.nec.com
Submitted: November 10, 1996
If DNA were a random string over its alphabet , an optimal code would assign bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications.
One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM).
Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared.
We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.841.87 bits when the prediction problem is divided into two stages: i) predict the next amino acid based on inexact polypeptide matches, and ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.
Keywords:DNA, Information Theoretic Entropy, Expectation Maximization (EM), Data Compression, Context Modeling, Time Series Prediction.
The DNA molecule encodes information which by convention is represented as a symbolic string over the alphabet . Assuming each character (nucleotide) is drawn uniformly at random from the alphabet, and that all positions in the string are independent, we know from elementary information theory [Cover and Thomas, 1991,Bell et al., 1990] that an optimal code will devote 2 bits to representing each character. This is the maximum entropy case. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates, and confirm our intuition that it is far from random. Unfortunately this has not been the case for many natural DNA sequences, including portions of the human genome.
One example is the human retinoblastoma susceptibility gene containing nucleotides, and referred to as HUMRETBLAS. The alphabet members do not occur with equal frequency in this string, and accounting for this yields an entropy estimate of roughly bits. This is perhaps not surprising since such single character models are the weakest in our information theory arsenal. A logical next step taken by several investigators, focuses instead on higher order entropy estimates arising from measurements of the frequencies of longer sequences. For natural languages (e.g. English) this step typically leads to significantly lower entropy estimates. But the best resulting estimate for HUMRETBLAS is roughly bits [Mantegna et al., 1993], still not impressively different from our bit random starting point. This may be something of a surprise, since such models reflect such known DNA structure as composition and CG suppression. But these known effects have little impact on the entropy of DNA sequences as a whole.
One may view DNA as a time series, and these higher order models as predictors of the next nucleotide based on its immediate past. With this interpretation the simple bit model relies on none of the past. The bit result is then even more surprising since it implies that knowledge of the immediate past reduces the entropy estimate by a mere bits.^{1} Data compression techniques such as LempelZiv (LZ) coding [Ziv and Lempel, 1977] may be viewed as entropy estimators, with LZ corresponding to a model that predicts based on a historical context of variable length. It ``compresses'' HUMRETBLAS to bits per character^{2}, which is actually worse than the flat random model we started with. The authors' earlier efforts implemented other advanced modeling techniques but failed to obtain estimates below approximately bits.
In this paper we describe a new statistical model, the strongest we are aware of, for naturally occurring DNA sequences. The result is lower entropy estimates for many sequences. Our experiments include a wide variety of DNA, and in almost all cases our model outperforms the best earlier reported results. For HUMRETBLAS, the result is approximately bits per character  bits improved from our bit starting level. The standard higher order estimate of bits is only bits lower, so in this sense our result represents a fivefold improvement. In every case our model outperforms the most common techniques. These results are significant because of our model's consistent and frequently substantial advantage over earlier methods.
So far we have described the quest for lower entropy estimates as a rather pure game, i.e. to predict better an unknown nucleotide based on the rest of the sequence. There are both conceptual and practical reasons to believe that this is an interesting pursuit and that our experimental results are of some importance. Conceptually, the pursuit of better models for data within some domain corresponds rather directly to the discovery of structure in that domain. That is, we model so that we can better understand. Practically, statistical modeling of DNA has proven to be somewhat effective in several problem areas, including the automatic identification of coding regions [Lauc et al., 1992,Cosmi et al., 1990,Cardon and Stormo, 1992,Farach et al., 1995,Krogh et al., 1994], and in classifying unknown sequences into one of a discrete number of classes. The success of such methods ultimately rests on the strength of the models they employ.
In section 2 we discuss the interpretation of entropy estimates obtained by methods such as ours, and by other approaches such as data compressors. The definition of entropy in the context of DNA nucleotide prediction is made more precise, and we discuss entropy estimates for coding regions.
A detailed exposition of our model is presented in section 3, but we first introduce and motivate it here in general terms. Conventional fixedcontext models focus on a trailing context, i.e. the immediately preceding nucleotides. Longer contexts reach farther into the past, and might reasonably be expected to result in stronger models. However as context length increases, it becomes increasingly unlikely that a given context has ever been seen. This problem has led to the development of variable length context language models [Bell et al., 1990], which use long contexts when enough earlier observations exists, and otherwise use shorter ones. Since as earlier noted, the shortterm behavior of many natural DNA sequences is so nearly random, these more advanced variable length schemes do not push farther into the past, except in the unusual event of a long exact repeat.
The basic idea behind our model is to push farther into the past by relaxing the exactmatch requirement for contexts. Our formalization of this idea in section 3 is one of the main contributions of this paper.
The metric for determining quality of match is primarily a simple Hamming distance, but in the interests of exploring the use of additional biological knowledge, three other metrics were evaluated in section 6. We also describe a twostage model for coding regions in which an amino acid is first predicted based on earlier amino acids, followed by the prediction of a particular codon. Section 6 concludes with a discussion that suggests why the high entropy levels observed for coding regions may not be so surprising after all.
The term entropy estimate can be somewhat vague. This section continues our introduction of the term by clarifying its various senses as they relate to DNA, and its relationship to data compression and prediction. We include the section in part to answer some of the most frequently asked basic questions about this work. Our discussion will first focus on the sense of entropy that corresponds to a stochastic process. Later, we describe another equivalent sense more suited to sequences of finite extent. The entropy rate of a stochastic process is defined as:
where this limit need not in general exist, and denotes the information theoretic entropy function (see [Cover and Thomas, 1991].) Given full knowledge of the process, and the limit's existence, the entropy rate is a welldefined attribute of the process. But we know very little of the process underlying the generation of natural DNA, and can merely observe the outcome, i.e. the nucleotide sequence. Given a predictive model , and a long DNA sequence , we are therefore content to consider:
This may be thought of as the average bits per symbol required to code increasingly long observation sequences. If our model matches Nature perfectly, Nature's process is in some sense wellbehaved^{3}, and our sequence is of infinite length, then this limit matches the entropy rate. Of course: i) our model does not match Nature, ii) there is no reason to believe Nature is in the required sense wellbehaved, and iii) we can observe only finite DNA sequences. It is in this sense that one is computing an entropy estimate when an imperfect model is applied to a finite empirically observed sequence.
Given a data compression program, Equation 2 may therefore be approximated by dividing the number of bits output by the number of symbols (nucleotides) input. We will refer to computations like this as purely compressive entropy estimates. Most existing compression algorithms, when applied to natural data, do in fact frequently approach a limit in bits/character  their estimate of the source's entropy. Unless the sequence being compressed is very long, the purely compressive estimate then overstates the actual entropy. One might then focus only on the codelengths near the end of the sequence. The risk however is then that the end might just happen to be more predictable and not reflect the overall entropy. For example, the end might, in the case of DNA, contain many long exact repeats of earlier segments. This suggests our second approach, that of crossvalidation entropy estimates. Here the sequence is first partitioned into equal size segments. The entropy of each segment is then computed by coding it having first trained the model using the remaining segments. The reported entropy estimate is the average of these estimates. One can imagine that each of these estimates moves one segment to the end of the sequence, codes it, and averages codelengths for the final segment only. As increases, the individual crossvalidation estimates may be plotted to visualize the instantaneous entropy rate as a function of position in the sequence. Several of these plots are presented in section 5.
In what follows, we will switch without harm between a compressive viewpoint, that of stochastic prediction in which the objective is to predict the next nucleotide, and generative stochastic modeling in which one imagines the data to emanate from a particular process. This is possible because a generative process gives rise to a prediction, and a prediction can be converted to a code via arithmetic coding. Starting from a data compressor, we can think of the bits output for a single input symbol, as of the probability that a corresponding predictor will predict, or generator will generate that symbol^{4}.
The models discussed so far in this paper introduce only one kind of domain knowledge: that natural DNA sequences include more near repetitions than one expects at random. Part of the reason that so little effort has been devoted to more advanced models that incorporate complex domain knowledge, is that models of the ignorant variety do a very good job of modeling other natural sequences. For example, Shannon's work estimated the entropy of English text to be roughly bits per character (see discussion in [Bell et al., 1990]). Random text over a 27 letter alphabet (AZ and space) corresponds to bits. Today's statistical models, with no specific knowledge of English syntax or semantics, can achieve roughly bits.
Returning to the question that opened our discussion, we submit that the essential cause of our surprise is that simple statistical models, which perform so well in other cases, fail to discover significant structure in DNA. Our work demonstrates that DNA is more predictable than had previously been established, but falls far short of the compression levels easily achieved for natural language. This suggests that local structure parallel to ``words'' and ``phrases'' is largely missing in DNA, even when the exact match requirement is relaxed.
As to ``what entropy level we expect and why'', we believe that the question ultimately comes down to one of quantifying the magnitude of the functional degeneracy of polypeptides. That is, how many different sequences are essentially equivalent functionally? If this degeneracy turns out to be large, then DNA is to its function, as an acoustic waveform is to human speech. That is, small local distortions do little to alter the overall message. This is discussed further in Section 6.
In this section we further motivate our model and then describe it in formal terms.
That natural DNA includes near repeats is well known, and in [Herzel, 1988] the statistics of their occurrence are discussed. We measure nearness using ordinary Hamming distance, i.e. the number of positions at which two equal length strings disagree. Given a target string of length , it is then natural to wonder how many distant matches exist in a string of length . As remarked earlier, DNA seems very nearly random over short distances. Assuming uniform randomness it is easily shown that we expect to see distant matches. Natural DNA sequences depart markedly from this behavior, providing one part of the motivation for our model. For example (table 1), using HUMRETBLAS, which consists of nucleotides, and a randomly chosen substring of length from the first of the sequence, one expects Hamming distance matches in the last of the sequence under the random model. We actually see where again, this is an average over all length targets in the last of HUMRETBLAS. When such matches exist, examination of the following nucleotide yields a prediction that is correct of the time. That these matches, when present, lead to good predictions is a second motivating factor. Unfortunately such Hamming distance matches occur for only of the positions in HUMRETBLAS, so alone, this effect is unlikely to lead to much better predictions on average. But if one again refers to table 1, roughly of the positions have an earlier match at distance  and at even this distance, predictions based on past matches are correct of the time. The potential effectiveness of predictions based on matches at all distances is our final motivating factor. The objective, then, is to combine these weak information sources to form a prediction. Also, a version of table 1 may be made for multiple windows. Our model combines all of this information and makes a single prediction. The use of near matches for prediction is certainly consistent with an evolutionary view in which Nature builds a genome in part by borrowing mutated forms of her earlier work.
One may view each row of table 1 (really each Hamming distance) as corresponding to a predictive expert, . The prediction of the Hamming distance expert, is formed by examining all past matches exactly this distance from our trailing context window, and capturing the distribution of following nucleotides by maintaining a simple table of counters. The simplest way to combine these experts is by a fixed set of weights that sum to one.
But suppose that while trying to predict a particular nucleotide using , our past experience includes no Hamming distance matches, i.e. the closest past window is distance . This information is known before we see the tobepredicted nucleotide and may therefore influence the prediction. In particular it makes no sense to give any weight to the opinions of the first three experts  in fact their opinion is not even welldefined in this case. Only the experts corresponding to Hamming distances are relevant. We're then interested in properly weighting them conditioned on the assumption that the closest match we've seen lies at distance . In what follows we will refer to this value as first Hamming, . Finally, since we don't know a priori how window size will influence the prediction, our model is formed at the uppermost level by a mixture of models, each considering a fixed window size from some fixed prior set.
We denote by the discrete random variable representing the nucleotide (basepair) to be predicted. Its four values are and correspond to . By we denote the positive integer random variable corresponding to the length of our trailing context window. The set of values it may assume is parameter of the model, defined by specifying nonzero prior probabilities for only a finite subset of the positive integers. For example, allowing considers context windows of these four lengths. Next, denotes the first Hamming distance to be considered. It may assume values . By we denote the index of each Hamming distance expert. So lies in the range . Finally, we use to represent our modeling past, i.e. the DNA that we have either already predicted, or have set aside as reference information.
Given a fixed window size , and distance , there is a natural prediction formed by locating all distance matches in the past to the trailing context of length , and then using the distribution of nucleotides that follow them. This is a single expert as described above. This prediction is independent of so . Our prediction arises from the joint probability by summing over the hidden variables as follows:
(3) 
The final term is then expressed as a product of conditionals:
(4) 
In this expression and for equal to the distance of the closest match to our trailing context window of length , in the . At all other values . That is, is a boolean function . Next, we assume is independent of the past in our model, so and consists of a vector of fixed mixing coefficients that select a window size. Introducing a dependency here may improve the model and represents an interesting area for future work. Finally, in our model is also independent of the past and consists of a fixed vector of mixing coefficients that select a Hamming distance given the earlier choice of a window size , and observation that the nearest past match is at distance . We then have:
(5) 
The first term in the summation is recognized as a single expert, the second selects an expert based on and , the third deterministically selects a single value, which receives probability , and the final term selects a window size. The first and third terms are determined entirely from the current context and the past; the second and fourth terms are selection parameters that must be tuned during a learning process.
The model may also be viewed generatively (reading from left to right) as depicted in figure 1. In this example a window size is first chosen from the set . The probability of each choice is shown beside it. Notice that they, and all choices at each node in the diagram, sum to unity. In the figure, is chosen and the next choice is deterministic. Here we've assumed that the closest match to our trailing context of length is distance away. Now given and we choose with the probabilities shown. Here we've assumed that is chosen. Finally, given all the earlier choices, we generate a nucleotide according to probabilities shown.
We refer again to figure 1 to clarify the model's parameter structure. The probabilities associated with the choice of are parameters, as are those associated with the choice of for each and . But recall that the probabilities on each arc in the graph are computed and are not therefore parameters. Similarly the probabilities associated with are not parameters. They too are computed and represent the opinion of a single expert as described earlier. Thus for the figure shown there are parameters for , and associated with , for a total of . Other parameters include our selection of window sizes to consider, and several explained below having to do with the learning algorithm.
The learning task before us is to estimate these parameters by examining the training set. Our algorithm is an application of the BaumWelch algorithm for Hidden Markov Models [Baum and Eagon, 1967,Baum et al., 1970,Baum, 1972,Poritz, 1988], and may also be viewed as an instance of Expectation Maximization (EM), a later rediscovery [Dempster et al., 1977,Redner and Walker, 1984] of essentially the same algorithm and underlying information theoretic inequality.
The BaumWelch algorithm is an optimization technique which locally maximizes the probability of generating the training set from the model by adjusting parameters. A global optimization method would be expected to yield better results in general. In the general case, however, the problem of globally optimizing directed acyclic graphbased models is known to be NPcomplete (see [Yianilos, 1997, p. 32]). Our model fits within this class but we have not considered the complexity of its optimization. It is known, however, to exhibit multiple local optima.^{5}
In the case of our model, the parameters are the probability associated with , , and the probability associated with , . Before learning begins, these parameters are initialized in our implementation to the flat (uniform) distribution. Examples from the training set are now presented repeatedly, and these parameters are reestimated upon each presentation  climbing away from their starting values towards some set of values that locally maximize the log likelihood of the training set.
The training set consists of a series of prediction problems: predict a given nucleotide using its trailing context and some past reference DNA. For the experiments of this paper, we select positions to be predicted at random from the training DNA. The rest of the sequence then serves as reference when searching for nearmatches. Because of the simple treelike branching structure of our model, the reestimation formulae are particularly simple and natural. Accumulated expectations are first computed by:
(6) 
where the probabilities are assumed to be conditioned on the model's current parameter set, and the summation is over our randomly chosen positions to be predicted. The reestimated window selection probabilities are then:
(7) 
Expectations for are accumulated according to:
(8) 
The reestimated conditional Hamming distance selection probabilities are then:
(9) 
This new parameter set must increase the likelihood of the training set under the model. Performing the steps above once corresponds to one EM iteration. Each iteration will continue to improve the model unless a local maximum is reached. In section 5 we will study the effect of these iterations on the performance of our model, and also examine the model's sensitivity to other parameters.
At the conclusion of each reestimation step above, the reestimated vector is normalized to sum to unity. This operation is also performed when an expert converts counts derived from earlier matches into a stochastic prediction for the next nucleotide. In both places it is possible for one or even all vector locations to be zero. The result is then one or more zero probabilities after normalization. If the entire vector is zero, the normalization operation is of course no longer even welldefined. To ensure that our system deals only with finite codelengths, and to handle the allzero case simply, we, as a practical expedient, add a small constant, , to each vector position before normalization. We stress that this is not done for the same reason that one might use Laplace's rule (add one to event counts) to flatten the prediction of a simple frequencybased model. Such techniques are basically a remedy for data sparseness. We require no such flattening since our model is itself a mixture of many experts, including some for which plenty of data is available. Experiments confirm our intuition that additional Laplacestyle flattening is in fact harmful.
We conclude by sketching the operation of a fully online model. Rather than performing EM iterations on a training set, expectations are accumulated as each new nucleotide is predicted. The model is then updated via the normalization (maximization) steps above before the next character is processed. To deal with data sparsity early on, normalization operations employ more aggressive flattening than above, but the constant employed decays over time to some very small value such as that above. The mathematical behavior of this algorithm is not as well understood as that of conventional offline training, and we will therefore study it independently in our future work.
The DNA sequences were chosen to pass the following criteria: sufficient length to support this type of entropy estimation method, inclusion of a wide variety of species and sequence types to evaluate the generality of the method, and inclusion of sequences used to benchmark other published methods. All sequences are from GenBank Release No. 92.0 of 18 December 1995. They are: thirteen mammalian genes (GenBank loci HSMHCAPG, HUMGHCSA, HUMHBB, HUMHDABCD, HUMHPRTB, HUMMMDBC, HUMNEUROF, HUMRETBLAS, HUMTCRADC, HUMVITDBP, MMBGCXD, MUSTCRA, RATCRYG), a long mammalian intron (HUMDYSTROP), a C. elegans cosmid (CEC07A9), seven prokaryote genes (BSGENR, ECO110K, ECOHU47, ECOUW82, ECOUW85, ECOUW87, ECOUW89), a yeast chromosome (SCCHRIII), a plant chloroplast genome (CHNTXX), two mitochondrial genomes (MPOMTCG, PANMTPACGA), five eukaryotic viral genomes (ASFV55KB, EBV, HE1CG, HEHCMVCG, VACCG), and three bacteriophage genomes (LAMCG, MLCGA, T7CG).
There has been considerable discussion in the literature of the statistical differences between coding and noncoding regions. To support our study of this issue we assembled several long strands consisting of purely coding or noncoding DNA. These sequences were formed by concatenating coding or noncoding regions from one or more of the above DNA sequences. We consider a region to be coding if either the sense or antisense strand is translated; otherwise it is noncoding.
To gather a larger body of coding regions, we obtained a data set of 490 complete human genes. This data set was screened to remove any partial genes, pseudogenes, mutants, copies, or variants of the same gene [Noordewier, 1996]. The resulting sequence contains bases and is referred to as our nonredundant data set.
Our model's performance on the sequences described in section 4 is summarized in table 2. In some cases our results may be compared directly with estimates from [Grumbach and Tahi, 1994], which are included in the table. Our values for (the 4symbol entropy) may be compared with the redundancy estimates of [Mantegna et al., 1993] and are in agreement. We have grouped our results by general type (i.e. mammalian, prokaryote, etc.).
The columns contain conventional multigram entropy estimates. The CDNA column reports our model's crossvalidation entropy estimates. Compressive estimates from the BIOCOMPRESS2 program of [Grumbach and Tahi, 1994] are contained in the following column. Our model's compressive estimates are given in the table's final column, CDNAcompress. The compressive estimates are generated by partitioning the sequence into 20 equal segments: . The entropy estimate for , is bits/nucleotide. is calculated using CDNA training on and testing on . is calculated from CDNA trained on and , tested on , and so forth. The overall estimate for CDNAcompress is .
The reported CDNA and CDNAcompress results depend on several model parameters. Their values and a sensitivity analysis are given later in this section. In the mammalian group, ``14 Sequences'', and the similarly named lines elsewhere in the table, represent the average of the entropy estimates for all relevant individual sequences, weighted by sequence length.
The conventional multigram entropy results are in general not too informative. It is interesting however to note the variation among sequences in even , the distribution of individual nucleotides. Only two of these estimates are below bits: bits for both HUMGHCSA and PANMTPACGA. In the first case CDNA reports bits. This is a very repetitive sequence and CDNA is able to exploit this to produce the lowest entropy estimate we observed for any sequence. In this case of PANMTPACGA, nearly all of the redundancy is explained by the unigraph statistics . The observed relationship between and gene density [Dujon et al., 1994] in yeast is reflected in the discrepancy in between the coding and noncoding region of Yeast chromosome III.
observe that the estimate is rarely better than , and in some cases is markedly worse (a consequence of limited sequence length).
As a point of comparison, entropy rate was also estimated simply by compressing the ASCII representation of the sequence using the UNIX compress utility and dividing the length of the compressed sequence by the length of the uncompressed sequence. Because of its limited dictionary size, the utility cannot be expected to converge to a good entropy estimate for long, complex sequences. For shorter sequences (roughly, sequences not greatly longer than 32,000 nucleotides), this limitation should have no impact. Notice that in all cases, the UNIX compress utility performs worse than no model at all (uniformly random prediction).
To make comparisons among entropy estimation methods clearer, figure 2 summarizes the results from table 2 for CDNA, , BIOCOMPRESS2, and CDNAcompress. In all cases CDNA outperforms conventional entropy estimates, and in almost all cases by a substantial margin. In many cases the compressive estimates from CDNA (i.e., CDNAcompress) are lower than those of BIOCOMPRESS2. In several they are comparable, and in only one case (VACCG) does BIOCOMPRESS2 outperform CDNAcompress by as much as 0.05 bits. It is possible that even this case is due to the offline nature of the current CDNA program which as mentioned above forces us to compute compressive estimates by dividing the sequence into large segments. Finally, we remark that while the table contains results for HUMRETBLAS, this sequence was used in our parameter sensitivity study below.
The CDNA program's behavior is parameterized by several values:
The distance measure is discussed later in section 6 and for all other experiments is fixed as described earlier. We used 3way crossvalidation to assess parameter sensitivity (see discussion below), and 8way crossvalidation for all experiments. Our test samples were of size per crossvalidation segment for a total of positions. To assess the variation due to all sampling, we performed different crossvalidation estimates using different random seeds. The mean entropy estimate for HUMRETBLAS was bits with a standard deviation of bits. Larger samples would result in smaller variance at the expense of computer time.
Figure 3 shows the effect of EM iterations on the entropy estimate for both training and test sets. This figure and the two that follow employ HUMRETBLAS. Based on this analysis, we chose EM iterations for our experiments. Notice that there is little overtraining. Both train and test performance are essentially constant after iterations, and their gap is small.
Figure 4 shows the effect of a single window size on performance. Beyond , overtraining is evident. This is to be expected since the number of parameters grows quadratically with window size. For our experiments we selected a mixture of windows of size . Odd values were excluded only to reduce memory requirements for the model. Additional experiments indicated that while mixtures of different window sizes did in general help, the benefit was somewhat small. We did not study this for all sequences and therefore employed a mixture to ensure that the model is as general as possible (in theory the mixture can only help).
Finally, figure 5 shows the sensitivity of performance on training set size. We used a training set of size for our experiments because beyond that point, little improvement was evident: train and test performance had essentially converged.
Our results so far have consisted of entropy estimates for sequences of a single type. We now present the results of several crossentropy experiments. By this we mean that the type of the training set is different from that of the test set. In table 3 mammalian coding regions and noncoding regions form the two types. A substantial crossentropic gap is apparent. That is, a model trained on the coding sequence might be used to distinguish coding from noncoding based on entropy. The same is true when one trains on the noncoding sequence.
The same experiment produces very different results for yeast as shown in table 4. While a large gap exists when one trains using the noncoding regions, a reverse gap exists in the other case. That is, training on the coding regions assigns shorter codes to the noncoding regions, than to itself. While this effect is weak, it suggests that sequences from the coding region may be echoed in the noncoding region.
Crossspecies entropy estimates are interesting in general, but our one example shown in table 5 shows little interesting structure. That is, yeast coding regions teach nothing about human coding regions and visa versa. In fact, the entropy estimates are in both cases worse than random. In the noncoding case, a weak relationship exists, but is easily explained by agreement on the low order multigraphic statistics of the sequences, e.g. .
The entropy estimates we have discussed reduce to a single number the model's performance on a sequence. It represents the average codelength but says nothing about their distribution. The histograms of figures 6,7, and 8, depict this distribution for the EBV virus, the human gene HUMRETBLAS, and the yeast chromosome SCCHRIII. They present an interesting visual signature of a sequence. Trimodality is apparent in all three, but is present to a varying extent.
The first mode, most prevalent in our viral example, corresponds to nearzero codelengths that correspond to long exact or nearexact repeats. In the viral case, this peak is particularly narrow because while the sequence is known to contain several long exact repeats, it lacks the full range of near repeats of various lengths found in sequences such as HUMRETBLAS. The middle mode corresponds to nucleotides having a long context that matches well with some other segments from the sequence. In this case short codes are assigned so long as these matches predict well the following nucleotide. The rightmost mode, particularly pronounced in SCCHRIII and less so in HUMRETBLAS, correspond to codes well over bits. Here the model had a strong prediction, but was wrong.
The distribution of codelengths displayed in these histograms says nothing about how codelengths are distributed with respect to position in the DNA sequence. To visualize this we have plotted in figures 9 and 10 the codelengths at random positions for HUMRETBLAS and SCCHRIII. It is apparent that the distribution is not homogeneous and presents another interesting visual signature of the sequence. The distribution for HUMRETBLAS is shown at a a closer scale (figure 11) where finer structure is revealed. The modality noted in the histograms is also apparent in these plots.
Our model exploits near matches to form a prediction, and its performance therefore depends on precisely what is meant by near, i.e., the distance metric used to judge string similarity. In this section we begin by discussing four metrics and report on their performance.
The simplest notion of string distance relevant to our domain is simple Hamming distance: the number of mismatching nucleotides in two strings of equal length. We will refer to this as the nucleotidesense metric. As mentioned earlier, our domain knowledge suggests that we also compare the strings after reversing one, and complementing its bases. Both distances are computed and the smaller one used. This is our nucleotideboth metric, and was used in all experiments from earlier sections.
Coding regions have additional structure which we capture in another pair of metrics: aminosense and aminoboth. A common misconception is that coding regions should compress well because three nucleotides together ( possibilities) code for only one of amino acids. This reasoning is flawed because of the possibilities represent valid codes for amino acids. Only codes are used to stop transcription. Thus the maximum entropy level for coding regions, is not bits, nor bits as for random unconstrained sequences, but rather is bits. In fact, it has been observed by several authors that coding regions are less compressible than noncoding regions (e.g., [Mantegna et al., 1993,Salamon and Konopka, 1992,Farach et al., 1995]). So it is clear that two sequences that code for the same polypeptide may nevertheless have large Hamming distance. The aminosense distance between two strings whose length is a multiple of 3, and that are assumed to be aligned with the reading frame, is straightforward: it is simply the number of mismatches in the amino acid sequence coded by the nucleotide sequence; where the stop signal is considered to be a 21st amino acid.
We extend this definition to the case in which the strings being compared are not necessarily a multiple of 3 in length, and we drop the assumption that they are aligned with the reading frame. The comparison does however assume that the strings are similarly aligned. If the first string begins one base into the reading frame, it is assumed that the other string does as well. That is, the strings must be in phase. This assumption is appropriate because our search for near matches will consider all possible positions, and in so doing all relative phases.
Selected experiments comparing these four metrics are shown in table 6^{6}. Entire sequences as well as their noncoding and coding regions were considered. The nucleotideboth metric performs best, and was therefore used for the experiments reported in earlier sections.
The amino metrics never outperformed their nucleotide counterparts. This is not so surprising for entire sequences, or noncoding regions, but we expected the amino metrics might perform better for coding regions. In fact their estimates are somewhat worse. Their poor performance is explained by the fact that they distinguish fewer discrete distances, and are always blind to the exact nucleotide structure of the sequence, even in the positions immediately before the base to be predicted.
The motivation in designing these metrics was the possibility that coding regions might contain long nearly identical amino acid sequences, which despite their nearness when expressed as an amino acid sequence, are far apart when coded as nucleotides. As explained above, our first attempt to capture this domain knowledge in the model was unsuccessful, so we tried another approach that produced a slightly stronger model for coding regions.
Our second approach separates the prediction problem into two stages. An amino acid is first predicted using CDNA based on inexact matches, where the sequence is viewed at the amino acid level, i.e., has alphabet size . Unlike our earlier models, which make no assumption as to the nature of the DNA, this model assumes that its input is a sequence of legal codons. The model's second stage predicts a particular codon, conditioned on the already predicted amino acid, and on earlier positions which are now viewed at the nucleotide level. The codon entropy estimate is then the sum of the entropies of these two stages, and we write:
(10) 
Dividing by three then gives a pernucleotide entropy which may be compared with earlier results. Table 7 gives the results of several models on our nonredundant data set.
The first line of table 7 provides for reference, the entropy corresponding to a uniformly random sequence of codons. The next line reports the standard multigraphic entropy where alphabet is nucleotides ( was optimal). Line three reports the result of the CDNA program operating on a sequence of nucleotides^{7}. Its performance is a mere bits better than . The fourth line reports the results from a twostage model where the first stage predicts an amino acid using a simple 2gram model (i.e., frequencies of amino acid pairs). The second stage predicts a codon, based on the predicted amino acid, and the two preceding nucleotides. In line 5, the 2gram model is replaced with CDNA program where the alphabet is now amino acids, and windows from to amino acids are used. As for earlier experiments, the result on line five arises from 8way cross validation. Therefore, nearby matches will not frequently contribute to a prediction. Line six employs what amounts to maximal cross validation, where a sequence element is predicted using everything else in the sequence. The same technique was used to produce figures 911.
It is apparent that our nonredundant collection of coding material is harder to model than the entire genes considered in earlier sections. Nevertheless we are able to improve the multigraphic estimate somewhat to . It is important to note our nonredundant data set is not a random sample, but rather is closer to the worst case. We expect that lower entropy estimates would result from random samples of similar size, and remark that this was the case when we considered Mammals/Codingonly in table 2. There we reported an estimate of bits using CDNA at the nucleotide level.
It is possible that the addition of more domain knowledge will produce lower entropy estimates for coding regions. However, we believe that the actual entropy may be not so far below our current estimates. This is consistent with the view that the coding regions of DNA commonly amount to a highly random string, with only isolated critical regions. If much of large proteins amounts to scaffolding, and the bulk of proteins in a genome are large, then we would expect large entropies. By scaffolding we mean weakly constrained structure necessary only to ensure that the protein folds correctly, and that particular active regions are positioned properly in the resulting molecule. There are certainly many genes that code for somewhat small proteins, but since entropy is a perbase expected value, the genes that code for large proteins have proportionaly greater affect on our estimate. Figure 12 shows that in our nonredundant collection of human genes, fewer than one third of bases are contained in coding regions of length or less.
We conclude with a crude analysis to quantify our belief. If only the hydrophobic/hydrophilic nature of an amino acid were critical, then at most one bit would be required to make an acceptable selection. This is then bit per base. If, except for this structure, the string is random, then the lowest entropy we would expect is: bits per base. Biological experiments to quantify the sensitivity of proteins of various length to mutation, would then shed light on the actual entropy of coding regions.
We have shown that the near repeats in natural DNA sequences may be incorporated into a statistical model resulting in significantly lower entropy estimates. For some sequences our model is the first to detect substantial deviations from random behavior and illustrates the importance of inexact string matching techniques in this field. It is entirely possible that very different results will be obtained, particularly for coding regions, when much more DNA is available for analysis.
It should be noted that entropy estimates include the known effect of [Gatlin, 1972] on entropy estimation, and BIOCOMPRESS2 includes the also known effect of long exact repeats and exact complement repeats [Herzel et al., 1994]. CDNA's generally superior performance indicates that DNA possesses more structure which may be exploited.
Several notions of distance were evaluated and the best performance resulted from considering both Hamming distance to reversed and complemented targets, as well as standard Hamming distance, then selecting the minimum. We also described twostage models crafted especially for coding regions, which perform better than our basic model on coding sequences.
Because of the observations in Section 2, a data compressor may be used as a classifier, and the result interpreted in probabilistic terms. We mention this because classification is one of the applications motivating our work. To illustrate, suppose one has two long sequences of DNA, the first of some type and the other of type . The task is to classify a third strand . Let denote the bits output by the compressor when fed sequence , and denote string concatenation. Then may by our remarks above be interpreted as . Similarly corresponds to . At this point one might simply compare to affect classification, or exponentiate yielding probabilities. These can then be combined along with prior class probabilities, resulting in Bayesian classifier built from data compressors.
Future work will consider stronger models exploiting additional effects. The current assumption that each prediction event is independent of the last may be dropped and a temporal dependence component added to the model. Next the metric may be enhanced to notice not only the number of mismatches, but also where they occur. Other structural changes are possible and the inclusion of additional domain knowledge in some form is a particularly interesting direction. Also, there are entirely different approaches one might take to the problem of combining experts.
Because of our interest in alternative metrics, simple linear search is used to find near matches. Our results indicate that standard Hamming distance performs well, so we plan to implement more advanced sublinear algorithms for this purpose.
Finally, we plan to investigate the application of these new, stronger models to various applications in the microbiological field.
http://paul.rutgers.edu/~loewenst/cdna.html
.






This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html split 0 image_type gif nonext_page_in_navigation noprevious_page_in_navigation up_url ../main.html up_title 'Publication homepage' numbered_footnotes cdna.tex
The translation was initiated by Peter N. Yianilos on 20020630