South African National Bioinformatics Institute. The University of the Western Cape, Private Bag X17, Cape Town, South Africa. info@sanbi.ac.za, www.sanbi.ac.za
WEB POSTER www.sanbi.ac.za/stack
Our aim is to maximize the generation of accurate high quality consensus sequences of human genes, so that we can derive the best possible estimate of the makeup of the human gene set from the data as it becomes available.
The huge size and computational complexity of the problem requires that a pre-processed core of information be generated, and that tools to manipulate it be both powerful and flexible. We have developed a novel set of highly portable tools and utilized a powerful multiprocessor system to manufacture and process a database of publicly available assembled consensus of Human ESTs and alignments. The database represents an easily distributable, core information resource upon which a comprehensive knowledgebase can be built.
The system has been designed to derive the longest current consensus representation of the expressed human genome.
Current EST clustering and processing projects tend to reduce "garbage-in garbage-out" by pruning poor quality sequence. The strategies are top down "splitter" quality-based, and thus build a cluster based on strict quality and overlap criteria. Strict criteria penalize generation of longer consensus for accuracy. Clusters can be subsequently pieced together by further overlap analysis, once the base clustering has been performed.
Use of a high performance method which does not use alignment in order to make clusters (d2-cluster, Burke, Davison, Hide in prep), allows utilization of a bottom up, "lumper" approach where larger clusters can be built that still contain "noisy" interspersed sequences but maintain strict statistical and empirical criteria for "correct" clustering. Noisy overlapping sequence is not important in deciding on membership of a particular cluster.
The database system that results differs markedly from indices such as TIGR Gene Index (1), and also databases of clusters of ESTs such as UniGene (2) because it does not discard noisy information. Instead, the "dirty" information is carefully checked for useful constituent subsequences. Longer gene consensus can be manufactured containing both high quality and poor quality regions; duly annotated. The database has relational access to annotated alignments via the Genome Sequence Database (3).
Subsequent alignment of the clusters can be performed by a number of algorithms. We chose the simulated annealing approach of TIGR_MSA-contig for accuracy, and then developed a specific consensus builder using a combination of two error analysis systems, DRAW and CONTIGPROC.
The resultant consensus has been collected into a highly error-qualitated Sequence Tag Alignment and Consensus Knowledgebase (STACK) made up of all publicly available expressed human genes. Each entry contains all variants of the gene consensus in serial association, separated by spacers. Internal 5' consensus is in random order. However, the maximum possible consensus for each gene exists in each entry.
Comparison of gene sequence, aligned clusters and "alternative splice" frequencies from STACK now allows a more comprehensive understanding of the nature of expressed genes to be performed. We are in the process of discovering what is artifact and what is genome biology.
References
(1) http://www.tigr.org/tdb/hgi/