CDNA is a cross-entropy estimator that reads a training file and a test file and computes a cross-entropy for the test file conditioned on the training file. The CDNA algorithm is described in Loewenstern and Yianilos, Significantly lower entropy estimates for natural DNA sequences, pages 151-160, Proceedings of the Data Compression Conference, 1997. Usage: cdna <#smp1> <#smp2> <#em> [...] , ... ASCII files containing the sequences to train and test -- consisting of characters and whitespace. <#smp1/2> ... Integer giving the number of random samples to select from the training/test file. <#smp1> limits time and memory for training. <#smp2> limits test time. ... Integer providing random seed. <#em> ... The number of EM iterations to perform ... The syntax is a/n:f/m. Specifying a:m corresponds to amino-hamming-distance with multiple directions, i.e. forward as well as reverse-complement matching attempts. Specifying n:f corresponds to nucleotide-hamming-distance with forward matching only. All four possibilities may be used: a:f, a:m, n:f, n:m. Note: a:f and a:m may only be used with the dna.alp alphabetfile. ... Each such term specifies context sizes to be used in the model. Each term may given in one three forms: "%d", "%d-%d", or "%d-%d:%d". The first form specifies a single context length, the second a range, and the third a range with an increment. ... First line gives alphabet size. The characters beginning on the following line define the alphabet. Any byte value is legal. Note: There is built-in support for cross-validation. If is given as '@8:3' then the training file is divided into 8 roughly equal segments. The third is deleted and becomes the test set. Copyright 1997. Please contact the authors for permission to use or distribute this data. We can be reached at: David Loewenstern Peter N. Yianilos Department of Computer Science NEC Research Institute Rutgers University, Busch Campus 4 Independence Way Piscataway, NJ 08855 Princeton, NJ 08540