(1) Laboratoire Associe de l'INRA (France), Departement of Genetics, VIB, University of Ghent, B-9000 GENT, Belgium
(2) Center for Biological Sequence Analysis, DTU, DK-2800 LYNGBY, Denmark,
(3) Novo Nordisk, DK-2800 BAGSVAERD, Denmark.
Arabidopsis thaliana is the plant model organism for genetics and development, and its genome has been chosen for exhaustive sequencing because of its small size (120 Mb). The transformation of these sequence data into biologically relevant information is a major challenge, as for every higher eukaryote. Although more than 30,000 Arabidopsis ESTs have been sequenced, at best half of the expected 20,000 genes can be identified from database similarity searches. Due to the specific "style" of the genome, the exon and gene prediction strategies developed for other species have to be tailored for Arabidopsis in order to be efficient.
We developed NetPlantGene for splice site prediction in Arabidopsis, using first a combination of neural networks looking on a large scale for coding potential and on local scales for donors and acceptors. This is an extension of the NetGene method developed for the vertebrate genomes. Strikingly, not only the species training set but also the optimal window sizes for each network differed between the vertebrates and Arabidopsis. To refine the predictions, biology-driven rules have been incorporated, with a significant increase in performance.
To improve further the acceptor prediction quality which remained lower than the one for donors, a scheme incorporating a branch-point search has been successfully devised, the false positive rate being reduced by a factor of 2, despite the fact that consensus sequences in plants are weak. To perform this step, it was first necessary to determine the plant branch-point consensus. There are very few experimental data on plant branch points, which was up to now searched in a circular way assuming an a priori similarity with the metazoan ones. A consensus was constructed by training an HMM on the Arabidopsis intron data set, with a model length of 7, A being fixed at position 6. The improved prediction of the acceptor using the consensus obtained in this way (the metazoan look-alike search failed to give any improvement), is taken as an argument for the predicted branch-point to be the genuine one.
A last improvement came from the analysis of the coding potential neural network with resources limited to 6 units, three of them being involved in reading frame detection. This observation was used to predict the proper coding phase (the class of splicing according to the codon), which should greatly improve gene structure modeling later on.
A major concern in gene prediction, at least when tested on the Arabidopsis genome, is the poor capacity to find the proper borders of genes. Bad grouping/cutting of exon clusters are often observed using the available modeling packages. We are currently working on methods able to predict confidently the 5' and 3' UTRs and their borders.
Last, every prediction relies on the assumption of a common unique model for the predicted object. This is probably a poor approximation for some elements used in gene prediction, and examples of classes are documented, e.g. for coding sequences in E. coli. For a given (group of) species, a better knowledge of the distribution of the gene elements used for gene prediction and modeling should help in their improvement, or at least in anticipating and taking in account possible pitfalls.
Hebsgaard et al., 1996, Nucl Acids Res, 24:3439-3452
Tolstrup N, Rouze P & Brunak S, 1997, Nucl Acids Res, 25:3159-3163