David Loewenstern*
NEC Research Institute, 4 Independence Way, Princeton, NJ
08540
Phone: 609-951-2798
Fax: 609-951-2483
Email: loewenst@paul.rutgers.edu
and
Peter N. Yianilos
NEC Research Institute, 4 Independence Way, Princeton, NJ
08540
and
Department of Computer Science
Princeton University, Princeton, New Jersey 08544
Email: pny@research.nj.nec.com
Submitted: November 10, 1996
If DNA were a random string over its alphabet ,
an optimal code would assign
bits to each nucleotide. DNA
may be imagined to be a highly ordered, purposeful molecule, and
one might therefore reasonably expect statistical models of its
string representation to produce much lower entropy estimates.
Surprisingly this has not been the case for many natural DNA
sequences, including portions of the human genome. We
introduce a new statistical model (compression algorithm), the
strongest reported to date, for naturally occurring DNA
sequences. Conventional techniques code a nucleotide using
only slightly fewer bits (1.90) than one obtains by relying only on
the frequency statistics of individual nucleotides (1.95). Our
method in some cases increases this gap by more than
five-fold (1.66) and may lead to better performance in
microbiological pattern recognition applications.
One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM).
Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared.
We also report lower entropy estimates for coding regions extracted from a large collection of non-redundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: i) predict the next amino acid based on inexact polypeptide matches, and ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of non-redundant coding sequences.
Keywords:DNA, Information Theoretic Entropy, Expectation Maximization (EM), Data Compression, Context Modeling, Time Series Prediction.
The DNA molecule encodes information which by convention is
represented as a symbolic string over the alphabet .
Assuming each character (nucleotide) is drawn uniformly at random from
the alphabet, and that all positions in the string are independent, we
know from elementary information theory
[Cover and Thomas, 1991,Bell et al., 1990] that an optimal code will devote
2 bits to representing each character. This is the maximum entropy
case. DNA may be imagined to be a highly ordered, purposeful molecule, and
one might therefore reasonably expect statistical models of its string
representation to produce much lower entropy estimates, and confirm
our intuition that it is far from random. Unfortunately this has not
been the case for many natural DNA sequences, including portions of
the human genome.
One example is the human retinoblastoma susceptibility gene
containing nucleotides, and referred to as HUMRETBLAS.
The alphabet members do not occur with equal frequency in this string,
and accounting for this yields an entropy estimate of roughly
bits. This is perhaps not surprising since such single character
models are the weakest in our information theory arsenal.
A logical
next step taken by several investigators, focuses instead on higher
order entropy estimates arising from measurements of the frequencies
of longer sequences. For natural languages (e.g. English) this step
typically leads to significantly lower entropy estimates. But the best
resulting estimate for HUMRETBLAS is roughly
bits
[Mantegna et al., 1993], still not impressively different from our
-bit
random starting point.
This may be something of a surprise, since such models reflect such
known DNA structure as
composition and CG suppression. But
these known effects have little impact on the entropy of DNA sequences
as a whole.
One may view DNA as a time series, and these higher order models
as predictors of the next nucleotide based on its immediate past.
With this interpretation the simple bit model relies on none of
the past. The
bit result is then even more surprising since it
implies that knowledge of the immediate past reduces the entropy
estimate by a mere
bits.1 Data compression techniques such as Lempel-Ziv (LZ)
coding [Ziv and Lempel, 1977] may be
viewed as entropy estimators, with LZ corresponding to a model that
predicts based on a historical context of variable length. It
``compresses'' HUMRETBLAS to
bits per
character2, which is actually worse than
the flat random model we started with. The authors' earlier efforts
implemented other advanced modeling techniques but failed to obtain
estimates below approximately
bits.
In this paper we describe a new statistical model, the strongest
we are aware of, for naturally occurring DNA sequences. The result is
lower entropy estimates for many sequences. Our
experiments include a wide variety of DNA, and in almost all cases our
model outperforms the best earlier reported results. For HUMRETBLAS, the
result is approximately bits per character --
bits
improved from our
bit starting level. The standard higher
order estimate of
bits is only
bits lower, so in this
sense our result represents a five-fold improvement. In every case
our model outperforms the most common techniques. These results are
significant because of our model's consistent and frequently substantial advantage
over earlier methods.
So far we have described the quest for lower entropy estimates as a rather pure game, i.e. to predict better an unknown nucleotide based on the rest of the sequence. There are both conceptual and practical reasons to believe that this is an interesting pursuit and that our experimental results are of some importance. Conceptually, the pursuit of better models for data within some domain corresponds rather directly to the discovery of structure in that domain. That is, we model so that we can better understand. Practically, statistical modeling of DNA has proven to be somewhat effective in several problem areas, including the automatic identification of coding regions [Lauc et al., 1992,Cosmi et al., 1990,Cardon and Stormo, 1992,Farach et al., 1995,Krogh et al., 1994], and in classifying unknown sequences into one of a discrete number of classes. The success of such methods ultimately rests on the strength of the models they employ.
In section 2 we discuss the interpretation of entropy estimates obtained by methods such as ours, and by other approaches such as data compressors. The definition of entropy in the context of DNA nucleotide prediction is made more precise, and we discuss entropy estimates for coding regions.
A detailed exposition of our model is presented in section 3, but we first introduce and motivate it here in general terms. Conventional fixed-context models focus on a trailing context, i.e. the immediately preceding nucleotides. Longer contexts reach farther into the past, and might reasonably be expected to result in stronger models. However as context length increases, it becomes increasingly unlikely that a given context has ever been seen. This problem has led to the development of variable length context language models [Bell et al., 1990], which use long contexts when enough earlier observations exists, and otherwise use shorter ones. Since as earlier noted, the short-term behavior of many natural DNA sequences is so nearly random, these more advanced variable length schemes do not push farther into the past, except in the unusual event of a long exact repeat.
The basic idea behind our model is to push farther into the past by relaxing the exact-match requirement for contexts. Our formalization of this idea in section 3 is one of the main contributions of this paper.
The metric for determining quality of match is primarily a simple Hamming distance, but in the interests of exploring the use of additional biological knowledge, three other metrics were evaluated in section 6. We also describe a two-stage model for coding regions in which an amino acid is first predicted based on earlier amino acids, followed by the prediction of a particular codon. Section 6 concludes with a discussion that suggests why the high entropy levels observed for coding regions may not be so surprising after all.
The term entropy estimate can be somewhat vague. This section
continues our introduction of the term by clarifying its various senses as they
relate to DNA, and its relationship to data compression and
prediction. We include the section in part to answer some of the most
frequently asked basic questions about this work. Our discussion will
first focus on the sense of entropy that corresponds to a
stochastic process. Later, we describe another equivalent sense more suited to
sequences of finite extent. The entropy rate of a stochastic
process is defined as:
where this limit need not in general exist, and denotes
the information theoretic entropy function (see [Cover and Thomas, 1991].)
Given full knowledge of the process, and the limit's existence, the
entropy rate is a well-defined attribute of the process. But we know
very little of the process underlying the generation of natural DNA,
and can merely observe the outcome, i.e. the nucleotide sequence.
Given a predictive model
, and a long DNA sequence
,
we are therefore content to consider:
This may be thought of as the average bits per symbol required to code increasingly long observation sequences. If our model matches Nature perfectly, Nature's process is in some sense well-behaved3, and our sequence is of infinite length, then this limit matches the entropy rate. Of course: i) our model does not match Nature, ii) there is no reason to believe Nature is in the required sense well-behaved, and iii) we can observe only finite DNA sequences. It is in this sense that one is computing an entropy estimate when an imperfect model is applied to a finite empirically observed sequence.
Given a data compression program, Equation 2 may
therefore be approximated by dividing the number of bits output by
the number of symbols (nucleotides) input. We will refer to
computations like this as purely compressive entropy estimates.
Most existing compression algorithms, when applied to natural data, do
in fact frequently approach a limit in bits/character -- their
estimate of the source's entropy. Unless the sequence being
compressed is very long, the purely compressive estimate then
overstates the actual entropy. One might then focus only on the
code-lengths near the end of the sequence. The risk however is then
that the end might just happen to be more predictable and not reflect
the overall entropy. For example, the end might, in the case of DNA,
contain many long exact repeats of earlier segments. This suggests
our second approach, that of cross-validation entropy estimates.
Here the sequence is first partitioned into equal size segments.
The entropy of each segment is then computed by coding it having first
trained the model using the remaining segments. The reported entropy
estimate is the average of these
estimates. One can imagine that
each of these estimates moves one segment to the end of the sequence,
codes it, and averages code-lengths for the final segment only. As
increases, the individual cross-validation estimates may be
plotted to visualize the instantaneous entropy rate as a function of
position in the sequence. Several of these plots are presented in
section 5.
In what follows, we will switch without harm between a compressive
viewpoint, that of stochastic prediction in which the objective is to
predict the next nucleotide, and generative stochastic modeling in
which one imagines the data to emanate from a particular process.
This is possible because a generative process gives rise to a
prediction, and a prediction can be converted to a code via arithmetic
coding. Starting from a data compressor, we can think of the bits
output for a single input symbol, as of the probability that
a corresponding predictor will predict, or generator will generate
that symbol4.
The models discussed so far in this paper introduce only one kind of domain
knowledge: that natural DNA sequences include more near
repetitions than one expects at random. Part of the
reason that so little effort has been devoted to more advanced models
that incorporate complex domain knowledge, is that models of the
ignorant variety do a very good job of modeling other natural
sequences. For example, Shannon's work estimated the entropy of
English text to be roughly bits per character (see discussion in
[Bell et al., 1990]). Random text over a 27 letter alphabet (A-Z
and space) corresponds to
bits. Today's statistical models,
with no specific knowledge of English syntax or semantics, can achieve
roughly
bits.
Returning to the question that opened our discussion, we submit that the essential cause of our surprise is that simple statistical models, which perform so well in other cases, fail to discover significant structure in DNA. Our work demonstrates that DNA is more predictable than had previously been established, but falls far short of the compression levels easily achieved for natural language. This suggests that local structure parallel to ``words'' and ``phrases'' is largely missing in DNA, even when the exact match requirement is relaxed.
As to ``what entropy level we expect and why'', we believe that the question ultimately comes down to one of quantifying the magnitude of the functional degeneracy of polypeptides. That is, how many different sequences are essentially equivalent functionally? If this degeneracy turns out to be large, then DNA is to its function, as an acoustic waveform is to human speech. That is, small local distortions do little to alter the overall message. This is discussed further in Section 6.
In this section we further motivate our model and then describe it in formal terms.
That natural DNA includes near repeats is well known, and in
[Herzel, 1988] the statistics of their occurrence are discussed. We
measure nearness using ordinary Hamming distance,
i.e. the number of positions at which two equal length
strings disagree. Given a target string of length , it is then
natural to wonder how many
-distant matches exist in a string of
length
. As remarked earlier, DNA seems very nearly random over
short distances. Assuming uniform randomness it is easily shown that
we expect to see
-distant
matches. Natural DNA sequences depart markedly from this behavior,
providing one part of the motivation for our model.
For example
(table 1), using HUMRETBLAS, which consists of
nucleotides, and a randomly chosen substring of length
from the first
of the sequence, one expects
Hamming
distance
matches in the last
of the sequence under the random
model. We actually see
where again, this is an average over
all
length
targets in the last
of HUMRETBLAS. When such matches exist, examination of the following
nucleotide yields a prediction that is correct
of the time.
That these matches, when present, lead to good predictions is a second
motivating factor. Unfortunately such Hamming distance
matches
occur for only
of the positions in HUMRETBLAS, so alone,
this effect is unlikely to lead to much better predictions on average.
But if one again refers to table 1, roughly
of
the positions have an earlier match at distance
- and at even
this distance, predictions based on past matches are correct
of the time. The potential effectiveness of predictions based on
matches at all distances is our final motivating factor. The
objective, then, is to combine these weak information sources to form a
prediction. Also, a version of table 1 may be made
for multiple windows. Our model combines all of this information and
makes a single prediction.
The use of near matches for
prediction is certainly consistent with an evolutionary view in which
Nature builds a genome in part by borrowing mutated forms of her
earlier work.
One may view each row of table 1
(really each Hamming distance) as corresponding to a predictive
expert,
. The prediction of the Hamming
distance
expert,
is formed
by examining all past matches exactly this distance from our trailing
context window, and capturing the distribution of following
nucleotides by maintaining a simple table of counters. The simplest
way to combine these experts is by a fixed set of weights that sum to
one.
But suppose that while trying to predict a particular nucleotide
using
, our past experience includes no Hamming distance
matches, i.e. the closest past window is distance
. This
information is known before we see the to-be-predicted
nucleotide and may therefore influence the prediction. In particular
it makes no sense to give any weight to the opinions of the first
three experts -- in fact their opinion is not even well-defined in
this case. Only the
experts corresponding to Hamming distances
are relevant. We're then interested in properly weighting them
conditioned on the assumption that the closest match we've seen lies
at distance
. In what follows we will refer to this value as
first Hamming,
. Finally, since we don't know a priori how
window size will influence the prediction, our model is formed at the
uppermost level by a mixture of models, each considering a fixed
window size from some fixed prior set.
We denote by the discrete random variable representing the
nucleotide (base-pair) to be predicted. Its four values are
and correspond to
. By
we denote the
positive integer random variable corresponding to the length of our
trailing context window. The set of values it may assume is parameter
of the model, defined by specifying non-zero prior probabilities for
only a finite subset of the positive integers. For example, allowing
considers context windows of these four lengths. Next,
denotes the first Hamming distance to be considered. It may
assume values
. By
we denote the index of each
Hamming distance expert. So
lies in the range
.
Finally, we use
to represent our modeling past, i.e. the DNA
that we have either already predicted, or have set aside as reference
information.
Given a fixed window size , and distance
, there is a
natural prediction
formed by locating all
distance
matches in the past to the trailing context of length
, and then using the distribution of nucleotides that follow them.
This is a single expert as described above. This prediction is
independent of
so
. Our prediction
arises from
the joint probability
by summing over the
hidden variables
as follows:
![]() |
(3) |
The final term is then expressed as a product of conditionals:
![]() |
![]() |
||
![]() |
(4) |
In this expression
and
for
equal to the distance of the closest match to our
trailing context window of length
, in the
. At all other
values
. That is,
is a boolean function
.
Next, we assume
is independent of the past in our model, so
and consists of a vector of fixed mixing
coefficients that select a window size. Introducing a dependency
here may improve the model and represents an interesting area for
future work. Finally, in our model
is also
independent of the past and consists of a fixed vector of
mixing coefficients that select a Hamming distance given the earlier
choice of a window size
, and observation that the nearest past
match is at distance
. We then have:
![]() |
(5) |
The first term in the summation is recognized as a single
expert, the second selects an expert based on and
, the third
deterministically selects a single
value, which receives
probability
, and the final term selects a window size. The first
and third terms are determined entirely from the current context and
the past; the second and fourth terms are selection parameters that
must be tuned during a learning process.
The model may also be viewed generatively (reading from left to right)
as depicted in figure 1. In this example a window
size is first chosen from the set
. The
probability of each choice is shown beside it. Notice that they, and
all choices at each node in the diagram, sum to unity. In the figure,
is
chosen and the next choice is deterministic. Here we've assumed that
the closest match to our trailing context of length
is distance
away. Now given
and
we choose
with
the probabilities shown. Here we've assumed that
is chosen.
Finally, given all the earlier choices, we generate a nucleotide
according to probabilities shown.
![]() |
We refer again to figure 1 to clarify the model's
parameter structure. The probabilities associated with the choice of
are parameters, as are those associated with the choice of
for
each
and
. But recall that the probabilities on each
arc
in the graph are computed and are not therefore parameters. Similarly
the probabilities associated with
are not parameters. They too
are computed and represent the opinion of a single expert as
described earlier. Thus for the figure shown there are
parameters
for
, and
associated with
, for a total of
. Other parameters include our selection of window sizes to
consider, and several explained below having to do with the learning
algorithm.
The learning task before us is to estimate these parameters by examining the training set. Our algorithm is an application of the Baum-Welch algorithm for Hidden Markov Models [Baum and Eagon, 1967,Baum et al., 1970,Baum, 1972,Poritz, 1988], and may also be viewed as an instance of Expectation Maximization (EM), a later rediscovery [Dempster et al., 1977,Redner and Walker, 1984] of essentially the same algorithm and underlying information theoretic inequality.
The Baum-Welch algorithm is an optimization technique which locally maximizes the probability of generating the training set from the model by adjusting parameters. A global optimization method would be expected to yield better results in general. In the general case, however, the problem of globally optimizing directed acyclic graph-based models is known to be NP-complete (see [Yianilos, 1997, p. 32]). Our model fits within this class but we have not considered the complexity of its optimization. It is known, however, to exhibit multiple local optima.5
In the case of our model, the parameters are
the probability associated with ,
, and the probability
associated with
,
.
Before learning begins, these parameters are
initialized in our implementation to the flat (uniform) distribution.
Examples from the training set are now presented repeatedly, and these
parameters are reestimated upon each presentation -- climbing away
from their starting values towards some set of values that locally
maximize the log likelihood of the training set.
The training set consists of a series of prediction problems: predict
a given nucleotide using its trailing context and some past
reference DNA. For the experiments of this paper, we select positions
to be predicted at random from the training DNA. The rest of the
sequence then serves as reference when searching for near-matches.
Because of the simple tree-like branching structure of our model, the
reestimation formulae are particularly simple and natural.
Accumulated expectations are first computed by:
![]() |
(6) |
where the probabilities are assumed to be conditioned on the model's current parameter set, and the summation is over our randomly chosen positions to be predicted. The reestimated window selection probabilities are then:
![]() |
(7) |
Expectations for are accumulated according to:
![]() |
(8) |
The reestimated conditional Hamming distance selection probabilities are then:
![]() |
(9) |
This new parameter set
must increase
the likelihood of the training set under the model. Performing the
steps above once corresponds to one EM iteration. Each
iteration will
continue to improve the model unless a local maximum is reached. In
section 5 we will study the effect of these
iterations on the performance of our model, and also examine the
model's sensitivity to other parameters.
At the conclusion of each reestimation step above, the reestimated vector is
normalized to sum to unity. This operation is also performed when an
expert converts counts derived from earlier matches into a
stochastic prediction for the next nucleotide. In both places it is
possible for one or even all vector locations to be zero. The result
is then one or more zero probabilities after normalization. If the
entire vector is zero, the normalization operation is of course no
longer even well-defined. To ensure that our system deals only with
finite code-lengths, and to handle the all-zero case simply, we, as a
practical expedient, add a small constant,
, to each
vector position before normalization. We stress that this is not done
for the same reason that one might use Laplace's rule (add one to
event counts) to flatten the prediction of a simple frequency-based
model. Such techniques are basically a remedy for data sparseness.
We require no such flattening since our model is itself a mixture of
many experts, including some for which plenty of data is
available. Experiments confirm our intuition that additional
Laplace-style flattening is in fact harmful.
We conclude by sketching the operation of a fully on-line model. Rather than performing EM iterations on a training set, expectations are accumulated as each new nucleotide is predicted. The model is then updated via the normalization (maximization) steps above before the next character is processed. To deal with data sparsity early on, normalization operations employ more aggressive flattening than above, but the constant employed decays over time to some very small value such as that above. The mathematical behavior of this algorithm is not as well understood as that of conventional offline training, and we will therefore study it independently in our future work.
The DNA sequences were chosen to pass the following criteria: sufficient length to support this type of entropy estimation method, inclusion of a wide variety of species and sequence types to evaluate the generality of the method, and inclusion of sequences used to benchmark other published methods. All sequences are from GenBank Release No. 92.0 of 18 December 1995. They are: thirteen mammalian genes (GenBank loci HSMHCAPG, HUMGHCSA, HUMHBB, HUMHDABCD, HUMHPRTB, HUMMMDBC, HUMNEUROF, HUMRETBLAS, HUMTCRADC, HUMVITDBP, MMBGCXD, MUSTCRA, RATCRYG), a long mammalian intron (HUMDYSTROP), a C. elegans cosmid (CEC07A9), seven prokaryote genes (BSGENR, ECO110K, ECOHU47, ECOUW82, ECOUW85, ECOUW87, ECOUW89), a yeast chromosome (SCCHRIII), a plant chloroplast genome (CHNTXX), two mitochondrial genomes (MPOMTCG, PANMTPACGA), five eukaryotic viral genomes (ASFV55KB, EBV, HE1CG, HEHCMVCG, VACCG), and three bacteriophage genomes (LAMCG, MLCGA, T7CG).
There has been considerable discussion in the literature of the statistical differences between coding and noncoding regions. To support our study of this issue we assembled several long strands consisting of purely coding or noncoding DNA. These sequences were formed by concatenating coding or non-coding regions from one or more of the above DNA sequences. We consider a region to be coding if either the sense or anti-sense strand is translated; otherwise it is non-coding.
To gather a larger body of coding regions, we obtained a data set
of 490 complete human genes. This data set was screened to remove any
partial genes, pseudogenes, mutants, copies, or variants of the same
gene [Noordewier, 1996]. The resulting sequence contains
bases and is referred to as our non-redundant data set.
Our model's performance on the sequences described in
section 4 is summarized in table 2. In
some cases our results may be compared directly with estimates from
[Grumbach and Tahi, 1994], which are included in the table. Our values for
(the 4-symbol entropy) may be compared with the
redundancy estimates of [Mantegna et al., 1993] and are in agreement. We
have grouped our results by general type (i.e. mammalian, prokaryote,
etc.).
The columns contain conventional multigram entropy
estimates. The CDNA column reports our model's cross-validation
entropy estimates. Compressive estimates from the BIOCOMPRESS-2
program of [Grumbach and Tahi, 1994] are contained in the following column.
Our model's compressive estimates are given in the table's final
column, CDNA-compress.
The compressive estimates are generated by partitioning the sequence
into 20 equal segments:
. The entropy estimate
for
,
is
bits/nucleotide.
is calculated using
CDNA training on
and testing on
.
is
calculated from CDNA trained on
and
, tested on
, and so forth. The overall estimate for CDNA-compress is
.
The reported CDNA and CDNA-compress results depend on several model parameters. Their values and a sensitivity analysis are given later in this section. In the mammalian group, ``14 Sequences'', and the similarly named lines elsewhere in the table, represent the average of the entropy estimates for all relevant individual sequences, weighted by sequence length.
The conventional multigram entropy results are in general not too informative.
It is interesting however to note the variation among sequences in even
, the distribution of individual nucleotides. Only two of these estimates
are below
bits:
bits for both HUMGHCSA and PANMTPACGA. In
the first case CDNA reports
bits. This is a very repetitive sequence
and CDNA is able to exploit this to produce the lowest entropy estimate
we observed for any sequence. In this case of PANMTPACGA, nearly all of the
redundancy is explained by the unigraph statistics
.
The observed relationship between
and gene density [Dujon et al., 1994] in yeast is reflected in the discrepancy
in
between the coding and non-coding region of Yeast chromosome
III.
observe that the estimate is rarely better than
, and in some cases
is markedly worse (a consequence of limited sequence length).
As a point of comparison, entropy rate was also estimated simply by compressing the ASCII representation of the sequence using the UNIX compress utility and dividing the length of the compressed sequence by the length of the uncompressed sequence. Because of its limited dictionary size, the utility cannot be expected to converge to a good entropy estimate for long, complex sequences. For shorter sequences (roughly, sequences not greatly longer than 32,000 nucleotides), this limitation should have no impact. Notice that in all cases, the UNIX compress utility performs worse than no model at all (uniformly random prediction).
To make comparisons among entropy estimation methods clearer,
figure 2 summarizes the results from
table 2 for CDNA, , BIOCOMPRESS-2, and
CDNA-compress.
In all cases CDNA outperforms conventional entropy estimates,
and in almost all cases by a substantial margin. In many cases the
compressive estimates from CDNA
(i.e., CDNA-compress) are lower than those of
BIOCOMPRESS-2.
In several they are comparable, and
in only one case (VACCG) does BIOCOMPRESS-2 outperform CDNA-compress by as much as 0.05 bits.
It is possible that
even this case
is due to the offline nature of the current CDNA program which as
mentioned above forces us to compute compressive estimates by dividing
the sequence into large segments.
Finally, we remark that while the
table contains results for HUMRETBLAS, this sequence was used in
our parameter sensitivity study below.
The CDNA program's behavior is parameterized by several values:
The distance measure is discussed later in
section 6 and for all other experiments is fixed
as described earlier. We used 3-way cross-validation to assess
parameter sensitivity (see discussion below), and 8-way
cross-validation for all experiments. Our test samples were of size
per cross-validation segment for a total of
positions.
To assess the variation due to all sampling, we performed
different cross-validation estimates using different random seeds.
The mean entropy estimate for HUMRETBLAS was
bits with
a standard deviation of
bits. Larger samples would result in
smaller variance at the expense of computer time.
Figure 3 shows the effect of EM iterations on the
entropy estimate for both training and test sets. This figure and the
two that follow employ HUMRETBLAS. Based on this analysis, we
chose EM iterations for our experiments. Notice that there is
little overtraining. Both train and test performance are
essentially constant after
iterations, and their gap is small.
Figure 4 shows the effect of a single window size on
performance. Beyond ,
overtraining is evident. This is to be
expected since the number of parameters grows quadratically with window
size. For our experiments we selected a mixture of windows of size
. Odd values were excluded only to reduce memory
requirements for the model. Additional experiments indicated that
while mixtures of different window sizes did in general help, the
benefit was somewhat small. We did not study this for all
sequences and therefore employed a mixture to ensure that the model is
as general as possible (in theory the mixture can only help).
Finally, figure 5 shows the sensitivity of
performance on training set size. We used a training set of size
for our experiments because beyond that point, little
improvement was evident: train and test performance had essentially
converged.
Our results so far have consisted of entropy estimates for sequences of a single type. We now present the results of several cross-entropy experiments. By this we mean that the type of the training set is different from that of the test set. In table 3 mammalian coding regions and non-coding regions form the two types. A substantial cross-entropic gap is apparent. That is, a model trained on the coding sequence might be used to distinguish coding from non-coding based on entropy. The same is true when one trains on the non-coding sequence.
The same experiment produces very different results for yeast as shown in table 4. While a large gap exists when one trains using the non-coding regions, a reverse gap exists in the other case. That is, training on the coding regions assigns shorter codes to the non-coding regions, than to itself. While this effect is weak, it suggests that sequences from the coding region may be echoed in the non-coding region.
Cross-species entropy estimates are interesting in general, but our
one example shown in table 5 shows little
interesting structure. That is, yeast coding regions teach nothing
about human coding regions and visa versa. In fact, the entropy
estimates are in both cases worse than random. In the non-coding
case, a weak relationship exists, but is easily explained by agreement
on the low order multigraphic statistics of the sequences, e.g.
.
The entropy estimates we have discussed reduce to a single number the model's performance on a sequence. It represents the average code-length but says nothing about their distribution. The histograms of figures 6,7, and 8, depict this distribution for the EBV virus, the human gene HUMRETBLAS, and the yeast chromosome SCCHRIII. They present an interesting visual signature of a sequence. Trimodality is apparent in all three, but is present to a varying extent.
The first mode, most prevalent in our viral example, corresponds to
near-zero code-lengths that correspond to long exact or near-exact
repeats. In the viral case, this peak is particularly narrow because
while the sequence is known to contain several long exact repeats, it
lacks the full range of near repeats of various lengths found in
sequences such as HUMRETBLAS. The middle mode corresponds to
nucleotides having a long context that matches well with some other
segments from the sequence. In this case short codes are assigned so
long as these matches predict well the following nucleotide. The
rightmost mode, particularly pronounced in SCCHRIII and less so
in HUMRETBLAS, correspond to codes well over bits. Here the
model had a strong prediction, but was wrong.
The distribution of code-lengths displayed in these histograms says
nothing about how code-lengths are distributed with respect to
position in the DNA sequence. To visualize this we have plotted in
figures 9 and 10 the
code-lengths at random positions for HUMRETBLAS and SCCHRIII. It is apparent that the distribution is not homogeneous
and presents another interesting visual signature of the sequence.
The distribution for HUMRETBLAS is shown at a
a closer scale (figure 11) where finer structure
is revealed.
The modality noted in the histograms is also apparent
in these plots.
Our model exploits near matches to form a prediction, and its performance therefore depends on precisely what is meant by near, i.e., the distance metric used to judge string similarity. In this section we begin by discussing four metrics and report on their performance.
The simplest notion of string distance relevant to our domain is simple Hamming distance: the number of mismatching nucleotides in two strings of equal length. We will refer to this as the nucleotide-sense metric. As mentioned earlier, our domain knowledge suggests that we also compare the strings after reversing one, and complementing its bases. Both distances are computed and the smaller one used. This is our nucleotide-both metric, and was used in all experiments from earlier sections.
Coding regions have additional structure which we capture in
another pair of metrics: amino-sense and
amino-both.
A common misconception is that coding regions should compress well
because three nucleotides together ( possibilities) code for only
one of
amino acids. This reasoning is flawed because
of the
possibilities represent valid codes for amino acids. Only
codes are used to stop transcription. Thus the maximum entropy
level for coding regions, is not
bits, nor
bits as for random unconstrained sequences, but rather is
bits. In fact, it has been observed by several
authors that coding regions are less compressible than non-coding
regions
(e.g., [Mantegna et al., 1993,Salamon and Konopka, 1992,Farach et al., 1995]).
So it is clear that two sequences that code for
the same polypeptide may nevertheless have large Hamming
distance. The amino-sense distance between two strings whose
length is a multiple of 3, and that are assumed to be
aligned with the reading frame, is straightforward: it is
simply the number of mismatches in the amino acid sequence
coded by the nucleotide sequence; where the stop signal is
considered to be a 21st amino acid.
We extend this definition to the case in which the strings being compared are not necessarily a multiple of 3 in length, and we drop the assumption that they are aligned with the reading frame. The comparison does however assume that the strings are similarly aligned. If the first string begins one base into the reading frame, it is assumed that the other string does as well. That is, the strings must be in phase. This assumption is appropriate because our search for near matches will consider all possible positions, and in so doing all relative phases.
Selected experiments comparing these four metrics are shown in table 66. Entire sequences as well as their noncoding and coding regions were considered. The nucleotide-both metric performs best, and was therefore used for the experiments reported in earlier sections.
The amino metrics never outperformed their nucleotide counterparts. This is not so surprising for entire sequences, or noncoding regions, but we expected the amino metrics might perform better for coding regions. In fact their estimates are somewhat worse. Their poor performance is explained by the fact that they distinguish fewer discrete distances, and are always blind to the exact nucleotide structure of the sequence, even in the positions immediately before the base to be predicted.
The motivation in designing these metrics was the possibility that coding regions might contain long nearly identical amino acid sequences, which despite their nearness when expressed as an amino acid sequence, are far apart when coded as nucleotides. As explained above, our first attempt to capture this domain knowledge in the model was unsuccessful, so we tried another approach that produced a slightly stronger model for coding regions.
Our second approach separates the prediction problem into two
stages. An amino acid is first predicted using CDNA based on
inexact matches, where the sequence is viewed at the amino
acid level, i.e., has alphabet size . Unlike our earlier
models, which make no assumption as to the nature of the DNA,
this model assumes that its input is a sequence of legal
codons. The model's second stage predicts
a particular codon, conditioned on the already predicted amino acid,
and on earlier positions which are now viewed at the nucleotide level.
The codon entropy estimate is then the sum of the entropies of these
two stages, and we write:
![]() |
(10) |
Dividing by three then gives a per-nucleotide entropy
which may be compared with earlier results. Table 7
gives the results of several models on our non-redundant data set.
The first line of table 7 provides for
reference, the entropy corresponding to a uniformly random
sequence of codons. The next line reports the standard
multigraphic entropy where alphabet is nucleotides
(
was optimal). Line three reports the result of the
CDNA program operating on a sequence of
nucleotides7. Its performance is
a mere
bits better than
. The fourth line
reports the results from a two-stage model where the first
stage predicts an amino acid using a simple 2-gram model
(i.e., frequencies of amino acid pairs). The second stage
predicts a codon, based on the predicted amino acid, and the
two preceding nucleotides. In line 5, the 2-gram model is
replaced with CDNA program where the alphabet is now
amino acids, and windows from
to
amino acids are
used. As for earlier experiments, the result on line five
arises from 8-way cross validation. Therefore, nearby matches
will not frequently contribute to a prediction. Line six
employs what amounts to maximal cross validation, where
a sequence element is predicted using everything else in the
sequence. The same technique was used to produce figures
9-11.
It is apparent that our non-redundant collection of coding
material is harder to model than the entire genes considered
in earlier sections. Nevertheless we are able to improve the
multigraphic estimate somewhat to
. It is
important to note our non-redundant data set is not a random
sample, but rather is closer to the worst case. We
expect that lower entropy estimates would result from random
samples of similar size, and remark that this was the case when
we considered Mammals/Coding-only in table 2.
There we reported an estimate of
bits using CDNA at the
nucleotide level.
It is possible that the addition of more domain knowledge will
produce lower entropy estimates for coding regions. However,
we believe that the actual entropy may be not so far below our
current estimates. This is consistent with the view that the
coding regions of DNA commonly amount to a highly random
string, with only isolated critical regions.
If much of large proteins amounts to
scaffolding, and the bulk of proteins in a genome are
large, then we would expect large entropies. By scaffolding we mean
weakly constrained structure necessary only to ensure that
the protein folds correctly, and that particular active regions
are positioned properly in the resulting molecule.
There are certainly many genes that code for somewhat small proteins, but since
entropy is a per-base expected value, the genes that code for large
proteins have proportionaly greater affect on our estimate.
Figure 12 shows that in our non-redundant collection of
human genes, fewer than one third of bases are contained in coding regions
of length or less.
We conclude with a crude analysis to
quantify our belief. If only the hydrophobic/hydrophilic nature of
an amino acid were critical, then at most one bit would be required to
make an acceptable selection. This is then bit per base.
If, except for this structure, the string is random, then the lowest
entropy we would expect is:
bits per base.
Biological experiments to quantify the sensitivity of proteins of
various length to mutation, would then shed light on the actual entropy of
coding regions.
![]() |
We have shown that the near repeats in natural DNA sequences may be incorporated into a statistical model resulting in significantly lower entropy estimates. For some sequences our model is the first to detect substantial deviations from random behavior and illustrates the importance of inexact string matching techniques in this field. It is entirely possible that very different results will be obtained, particularly for coding regions, when much more DNA is available for analysis.
It should be noted that entropy estimates include the known
effect of
[Gatlin, 1972] on entropy estimation, and BIOCOMPRESS-2
includes the also known effect of long exact repeats and exact complement
repeats [Herzel et al., 1994].
CDNA's generally superior performance indicates that
DNA possesses more structure which may be exploited.
Several notions of distance were evaluated and the best performance resulted from considering both Hamming distance to reversed and complemented targets, as well as standard Hamming distance, then selecting the minimum. We also described two-stage models crafted especially for coding regions, which perform better than our basic model on coding sequences.
Because of the observations in Section 2, a data compressor may be used as a
classifier, and the result interpreted in probabilistic terms. We
mention this because classification is one of the applications
motivating our work. To illustrate, suppose one has two long
sequences of DNA, the first of some type
and the other of
type
. The task is to classify a third strand
. Let
denote the bits output by the compressor when fed sequence
, and
denote string concatenation. Then
may by our
remarks above be interpreted as
. Similarly
corresponds to
. At this point one
might simply compare to affect classification, or exponentiate
yielding probabilities. These can then be combined along with prior
class probabilities, resulting in Bayesian classifier built from data
compressors.
Future work will consider stronger models exploiting additional effects. The current assumption that each prediction event is independent of the last may be dropped and a temporal dependence component added to the model. Next the metric may be enhanced to notice not only the number of mismatches, but also where they occur. Other structural changes are possible and the inclusion of additional domain knowledge in some form is a particularly interesting direction. Also, there are entirely different approaches one might take to the problem of combining experts.
Because of our interest in alternative metrics, simple linear search is used to find near matches. Our results indicate that standard Hamming distance performs well, so we plan to implement more advanced sublinear algorithms for this purpose.
Finally, we plan to investigate the application of these new, stronger models to various applications in the microbiological field.
http://paul.rutgers.edu/~loewenst/cdna.html
.
|
|
|
|
|
|
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -numbered_footnotes cdna.tex
The translation was initiated by Peter N. Yianilos on 2002-06-30