Department of Biochemistry, University of Washington, Seattle, WA 98195
Locating protein coding regions in genomic DNA is a critical step in the utilization of the information generated in large scale sequencing projects. Current methods for gene recognition rely primarily on the differences in codon frequencies in coding and non-coding DNA, splice site detection, and homology to previously identified proteins. We propose that recurrent amino acid sequence patterns 3-19 amino acids in length are a powerful addition to the "content statistics" used in current gene finding approaches. Our group has developed sequence patterns known to strongly correlate to local protein structures, including several new patterns.
A finite mixture model based on these patterns is trained using Expectation Maximization, and is shown to partially discriminate coding sequences which have no detectable homology to known proteins from randomized versions of these sequences, while the same model finds virtually no features in short (less than 50 amino acids) open reading frames extracted from the S. cerevisiea genome. The effects of low complexity sequences in both coding and non-coding open reading frames will be discussed. We anticipate that the addition of a module that detects these patterns is likely to improve the performance of currently used gene recognition methods.