Dept. of Mathematics, Stanford University / Dept. of Biology, MIT
The first level of analysis of a novel genomic sequence typically involves asking the questions: 1) Does the sequence contain any genes? 2) If so, what are the exact locations of the coding exons? (And therefore what is the sequence of the encoded protein?) In the past several years, a wide variety of computational approaches have been developed in an effort to answer these questions (e.g., GENSCAN, FGENEH, GeneID, GeneMark, GenParser, Genie, GRAIL and PROCRUSTES) and several recent methods have achieved relatively high levels of predictive accuracy. However, simply predicting the most likely gene structure in a genomic sequence is only a partial solution to the general problem of identifying and characterizing novel genes, and a second level of computational and experimental methods is typically required in order to confirm gene locations and expression patterns. The work described here is aimed at facilitating this second level of genomic sequence analysis by: 1) providing accurate quantitative measures of the quality (reliability) of exons predicted by GENSCAN; and 2) attempting to predict through computational means whether or not a novel gene is alternatively spliced and, if so, the precise locations of alternative exons and introns. The first goal has been largely achieved by calculating explicit "exon probabilities" (using a modified "forward-backward" procedure) under the probabilistic model of gene sequence/structure employed by GENSCAN. The second goal remains elusive, but evidence will be presented that the presence of alternative splicing is correlated with the presence of high probability "suboptimal exons", i.e. potential exons which have relatively high probability but are not included in the optimal (highest probability) gene structure predicted by GENSCAN.