School of Biology, Georgia Institute of Technology
GeneMark, the statistical pattern recognition method for gene prediction from DNA sequence, was used for gene identificationin complete genomes of Haemophilus influenzae, Mycoplasma genitalium, Methanococcus jannaschii, Helicobacter pylori and Escherichia coli. The novel features of GeneMark: using inhomogeneous Markov models for genes and gene shadows were sufficient for identifying locations of 94% of genes, as shown in the case of E. coli genome. However, not only some short genes were missing but 5' gene boundaries were precisely determined only in 68% of cases.
To address this problem in the new algorithm, GeneMark.hmm, gene boundaries were interpreted as transitions between Hidden Markov states and sliding window technique used in regular GeneMark was abondoned. With overall performance of GeneMark.hmm slightly higher than GeneMark (95% locations of E.coli genes found), the percentage of precisely found genes has increased significantly. One important feature of GeneMark.hmm design is that this method does not need elaborate HMM training. The Markov models already derived for GeneMark could be used in the GeneMark.hmm program.
Recently, we have met a difficulty with the model derivation. For newly sequenced genomes, such as Methanococcus jannaschii and Helicobacter pylori, experimentally validated training sets were not available.For this case the iterative clustering procedure, called GeneMark-Genesis, was developed. When applied to a whole DNA sequence of a new genome, GeneMark-Genesis determines the parameters of Typical and Atypical gene models that are used in GeneMark and GeneMark.hmm.
Modification of the GeneMark.hmm algorithm for finding eukaryotic geneshas produced promissing results. When the set of Human genes described by Burset & Guigo was used as a control set the GeneMark.hmm program was able to precisely determine exon/intron structure of the gene in 58% of cases.