DNA Sequence Classification using Compression-Based Induction

David Loewenstern and Haym Hirsh and Peter Yianilos and Michiel Noordewier

Abstract: Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identification tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identification problems forms models that depend on the absolute locations of nucleotides and assume independence of consecutive nucleotide locations. This paper describes a new class of learning methods, called compression-based induction (CBI), that is geared towards sequence learning problems such as those that arise when learning DNA sequences. The central idea is to use text compression techniques on DNA sequences as the means for generalizing from sample sequences. The resulting methods form models that are based on the more important relative locations of nucleotides and on the dependence of consecutive locations. They also provide a suitable framework into which biological domain knowledge can be injected into the learning process. We present initial explorations of a range of CBI methods that demonstrate the potential of our methods for DNA sequence identification tasks.

Links:

The complete paper (PDF)
The complete paper (PostScript)
BibTex entry
My homepage ( Peter N. Yianilos )
See also:
- Significantly lower entropy estimates for natural DNA sequences
- The ANSI-C CDNA Software and Dataset
Coauthors:
- David Loewenstern
- Haym Hirsh
- Michiel Noordewier

This manuscript also appeared as an NEC Research Institute Technical Report.