ABSTRACTS OF KEYNOTE, PLENARY & CONTRIBUTED LECTURES
KEYNOTE LECTURE ABSTRACTS
Sequence Analysis: Where We Were, Where We Are,
and Where We Might Go
Russell F. Doolittle
Center for Molecular Genetics,University of California, San Diego
The last 35 years have been witness to enormous strides
in sequence analysis. Earlier efforts at numerical taxonomy
not withstanding, it was the availabilty of digitized
information in the way of amino acid sequences coincident with the
development of the digital computer that was so propitious.
The unraveling of the genetic code along with the revelation
that gene duplications were at the heart of organismal
diversity set the stage for molecular explanations of life
on Earth. The early use of digital computers for aligning
characters and determining relationships predated recombinant
DNA studies. In turn, the DNA revolution has provided the
torrent of data that may make a complete undestanding possible.
At this point, the field is brimming with technology and
data both. What should we be trying to learn? The pincipal
goal of the evolutionist remains the reconstruction of
past events that have given rise to the present. Initially,
it was thought that it would be a straightforward matter to
compare genes and genomes and work backwards to a common ancestor.
A major complication arose with the unexpected frequency of
apparent lateral gene transfers, to the point where at this point
there isn't universal agreement on the general nature of the
Tree of Life. A more ambitious challenge is determining
the likely gene content of the earliest cells.
Computers alone will not reveal the truth here.
More data are needed in certain critical areas, especially
from early diverging protists harboring particular bacterial
and archaeal symbionts.
Predicting the Future Course of Human Influenza Virus Evolution
Walter M. Fitch(1), Robin M. Bush(1), Catherine A. Bender(2), Kanta Subbarao(2), and Nancy J. Cox(2)
1. University of California-Irvine, Irvine, CA
2. Centers for Disease Control, Atlanta, GA
We have recently identified 18 codons in the hemagglutinin gene (H3) that are under positive selection to change the encoded amino acid. We believe that this selection is to reduce the effectiveness of the human's immune surveillance by altering the viral surface and using the time bought to get off further rounds of replication before the body cures the disease.
We have also noted previously that the shape of the phylogenetic tree for human influenza is unusual in its having one long trunk with many very short side branches. The trunk obviously represents what may be considered the evolving winners in a competition among the viruses to be the progenitor of future strains of influenza virus. If all that is so, then perhaps a study of replacements in these 18 positions will be indicative of who will win that race before the race is over.
Accordingly, we examined the evolution of the amino acid sequences of HA1 on the evolutionary tree from the genes of 357 strains of human influenza virus isolated from 1983 through 1997. We counted how many replacements in those sites occurred between the root and the tip where each isolate was located. We did this for each flu season using only the sequences through that flu season. The strain with the most replacements in these 18 positions was predicted to indicate where the trunk would emerge in future influenza seasons. This was done for eleven flu seasons. The method correctly predicted the location of the trunk in nine of the eleven trials and in all of the last eight seasons examined.
Introns and Modules in Ancient Conserved Genes
Walter Gilbert
Harvard University, Cambridge, MA
We have studied the correlation between introns and modules, compact regions of protein structure, in genes whose products have a known 3-dimensional structre and which are homologous between bacteria and eukaryotes. Using two definitions of modules, we can show that phase zero introns, those that lie between the codons, are significantly correlated with the boundaries of modules while introns that lie in phases one and two, that interrupt the codons are not so correlated. We will discuss the significance of this finding, which suggests that some of the phase zero introns are residues of the original construction of the genes while the phase one and two introns may have been added later in evolution.
PLENARY LECTURE ABSTRACTS
(Alphabetical listing by presenter)
Progress in ab initio Protein Structure Prediction
David Baker
University of Washington, Seattle, WA
I will first discuss the development of computational models for local
sequence-structure relationships in proteins. We used clustering methods to
identify recurrent sequence patterns that transcend protein family
boundaries, and subsequently an iterative refinement procedure to sharpen
local sequence-structure relationships. The resulting ISITES library of
sequence-structure motifs has proven capable of identifying segments of
proteins which adopt native like structure in isolation. The ISITES motifs
have been developed into a mixture model and a hidden Markov model for
general protein sequence which have shown promise for gene recognition and
local structure prediction. I will then describe ROSETTA, a new method for
ab initio tertiary structure prediction, which is based on the idea that
successful tertiary structure prediction requires modeling of both local and
non local interactions in proteins, but at different levels of detail.
Promising results in the recent CASP3 protein structure prediction
experiment suggest that the low resolution structure prediction problem for
proteins of less than 100 amino acids may be solved in the relatively near
future, and that ab initio structure prediction methods may be able to
contribute to the interpretation/annotation of genomic sequence information.
Exploiting the Past and the Future in Protein Secondary Structure Prediction
Pierre Baldi
University of California, Irvine
Predicting the secondary structure of a protein (alpha-helix,
beta-sheet, coil) is an important step
towards elucidating its three dimensional structure, as well as its
function. Presently, the best predictors are based on machine learning
approaches, in particular neural network architectures with a fixed,
and relatively short, input window of amino acids, centered at the
prediction site. Although a fixed small window
avoids overfitting problems, it does not permit to capture variable
long-ranged information.
We introduce a family of novel
architectures which can learn to make predictions based on variable ranges
of
dependencies. These architectures extend recurrent neural networks,
introducing non-causal bidirectional dynamics to capture both upstream
and downstream information. The prediction algorithm is completed by
the use of mixtures of estimators that leverage evolutionary
information, expressed in terms of multiple alignments, mostly at the
input level. While our system currently
achieves an overall performance exceeding 75% correct
prediction - at least comparable to the best existing systems - the
main emphasis here is on the development of new algorithmic ideas.
Computational Approaches for Structural Genomics
Steven E. Brenner
Department of Structural Biology,Stanford University
Structural genomics aims to provide a good experimental structure or
computational model of every tractable protein in a complete genome.
Underlying this goal is the immense value of protein structure, especially
in permitting recognition of distant evolutionary relationships for
proteins whose sequence analysis has failed to find any significant
homolog. A considerable fraction of the genes in all sequenced genomes
have no known function, and structure determination provides a direct
means of revealing homology which may be used to infer their putative
molecular function. The solved structures will be similarly useful for
elucidating the biochemical or biophysical role of proteins that have been
previously ascribed only phenotypic functions. More generally, knowledge
of an increasingly complete repertoire of protein structures will aid
structure prediction methods, improve understanding of protein structure,
and ultimately lend insight into molecular interactions and pathways.
We use computational methods to select families whose structures cannot be
predicted and which are likely to be amenable to experimental
characterization. Methods to be employed included modern sequence
analysis and clustering algorithms. A critical component is consultation
of the Presage database for structural genomics, which records the
community's experimental work underway and computational predictions. The
protein families are ranked according to several criteria including
taxonomic diversity and known functional information. Individual
proteins, often homologs from hyperthermophiles, are selected from these
families as targets for structure determination. The solved structures
are examined for structural similarity to other proteins of known
structure. Homologous proteins in sequence databases are computationally
modeled, to provide a resource of protein structure models complementing
the experimentally solved protein structures.
The Delphic Boat : What Bacterial Genomes Tell Us?
Antoine Danchin
Institut Pasteur, Paris, France
Contrary to an intuitive idea there is often no predictable link between structure and function in biological objects. Therefore one cannot use genomes alone to discover the meaning of the genome texts. We shall illustrate how the concept of "neighborhood" can be used to create links between a variety of information sources, making the genome text mean something relevant. In particular having analyzed the overall way in which DNA is manipulated in various bacteria we shall see that, despite strong variation, there are strong constraints in the organisation of genes. Observing biases in features which would be thought to be unbiased is the hallmark of some selection pressure. Because the genetic code is redundant, coding sequences can be studied by analyzing their codon preference. The selection pressure maintaining this bias is linked to the organization of the cell's cytoplasm, which is clearly not a tiny test tube. It consists in a ribosome network, moving slowly with respect to local diffusion of the small molecules and macromolecules present in the cell. Ribosomes act as attractors of certain tRNA species, as a function of the local codon usage of the mRNA molecules they translate. This adapts codon usage of the gene corresponding to a given function to the position of its product in the cell. In particular if two genes have a very different codon usage this means that the messager RNAs are not translated at the same place in the cell. The messenger threads are pulled out from DNA by the network of translating ribosomes, going from one ribosome to the next one, as in a wiredrawing machine. Organization of the genes into polycistronic operons results in the fact that proteins having related functions are co-expressed locally, allowing compartmentalization of the corresponding substrates and products. As a consequence, if one goes from a very biased ribosome to a less biased one, the local concentration of the most biased tRNAs decreases. In turn this creates a selection pressure that produces a gradient in codon usage, as one goes further away from the most biased messengers and ribosomes. Investigation of the sequence of known genomes suggests that genes are not randomly distributed along the chromosome. In fact there is an extreme bias in the genes that permit one to differentiate explicitely genes coded by the leading strand from genes coded by the lagging strand. We shall discuss how these features should be implemented in the construction of realistic models of genomes in order to identify new unexpected signals.
Computational Screens for Noncoding RNA Genes in Archaeal Genomes
Sean Eddy
Deptartment of Genetics, Washington University School of Medicine
Some genes produce functional RNAs instead of encoding
proteins. Current genefinding approaches focus almost exclusively on
protein-coding genes, so the diversity of the "modern RNA world" is an
open question. We are developing probabilistic modeling approaches --
specifically, hidden Markov models and stochastic context-free
grammars -- to identify noncoding RNA genes in genome sequence
data. We have been using these methods to study the diversity of small
nucleolar RNAs (snoRNAs), which are responsible for guiding specific
nucleotide modifications of eukaryotic ribosomal RNAs. In
collaboration with Pat Dennis's group at the University of British
Columbia, we have recently found that Archaeal genomes also have
numerous snoRNA genes. Many of our predictions have been confirmed
both experimentally and by comparative analysis among three available
Pyrococcus genome sequences. Because snoRNAs are still unknown in
Bacteria, this result suggests another shared character between the
Archaeal and Eukaryotic lineages.
Comparative Genomics of Three Pyrococcus Species:
Evidences for a Single Replication Origin in these Hyperthermophilic
Archaea and Gene Transfer From Bacteria To Archaea
Patrick Forterre(1), Olivier Pocs(2), Hannu Myllykallio(1), Philippe Lopez(3),
Hervé Philippe(3), Odile Leconte(2), Raymond Ripp(2), Jean-Claude Thierry(2),
Roland Helig(4), Valérie Barbe(4), William Saurin(4), Jean Weissenbach(4), Yvan
Zivanovic(1)
1. Institut de Génétique et Microbiologie, Université de Paris-Sud, Orsay,
France
2. Institut de Génétique et de Biologie Moléculaire et Cellulaire,
Illkirch, CU de Strasbourg , France
3. Laboratoire de Biologie cellulaire,
Université de Paris-Sud, Orsay, France
4. Génoscope, Centre National de
Séquençage, Evry, France
We have completely sequenced the genome of the hyperthermophilic archaeon
Pyrococcus abyssi and performed in silico comparison with the genomes of
Pyrococcus horikoshii and Pyrococcus furiosus. This study emphasizes the
plasticity of genomes at the genus level and reveals extensive transfer of
genes from Bacteria to Pyrococcus. We have detected in silico (1) a single
and similar origin of replication in the three Pyrococcus species and
obtained experimental evidence that validate our prediction in P. abyssi.
The implication of our findings for the evolution of Archaea in general and
the evolution of the replication apparatus in particular will be discussed
in the light of alternative theories about the nature of the Last Universal
Common Ancestor (LUCA) (1,3)
1. LOPEZ, P., PHILIPPE, H, MYLLYKALLIO, H. and FORTERRE, P. Identification
of putative chromosomal origins of replication in archaea. Mol. Microbiol.
32, 883-886 (1999)
2. FORTERRE, P. Displacement of cellular proteins by functional analogues
from plasmids or viruses could explain puzzling phylogenies of many DNA
informational proteins, Mol. Microbiol.33, 457-465 (1999)
3. FORTERRE, P. and PHILIPPE, H Where is the root of the universal tree of
life? Bioessays, in press
Analysis of Microarray Gene Expression Data Using Support Vector Machines
David Haussler
Co-authors: M. P. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet,
T. Furey, and M. Ares, Jr.
Computer Science Department, University of California, Santa Cruz
We have developed a new method of functionally classifying genes
using gene expression data from DNA microarray hybridization
experiments. We train a Support Vector Machine (SVM)
to distinguish expression patterns from genes in one
class of co-regulated genes from expression patterns of genes
in other classes. We have applied the method with some success
to expression data collected for yeast genes by the Pat Brown laboratory.
After training on expression patterns for two thirds of the ribosomal yeast
genes, it identified the remaining known ribosomal genes from among the
approximately 6000 yeast genes with few errors, and identified one
incorrect annotation in the MYGD database. Experiments with other
classes of co-regulated genes also gave useful results, but accuracy depends
on how strongly the class of co-regulated genes exhibits a common
and distinctive expression pattern in the set of DNA microarray hybridization
experiments that is performed.
Block-Based Methods for Detecting Protein Homology
Steven Henikoff, Jorja G. Henikoff and Shmuel Pietrokovski
Fred Hutchinson Cancer Research Center
Whereas sequence databanks continue to expand rapidly, most newly
discovered protein sequences belong to families characterized by conserved
alignment blocks. Methods that utilize blocks rather than single
sequences can efficiently detect weak similarities that reflect functional
constraints. These include: 1) Embedding methods, which extract multiple
alignment information from motif regions while retaining single sequence
information where alignment is uncertain; 2) A graphing method, which is
applied to the results of an all-versus-all search of the Blocks Database
using the LAMA (Local Alignment of Multiple Alignment) searching tool, for
detection of structurally-constrained motifs shared by different protein
families; 3) Design of PCR primers for isolating distantly-related
protein-coding sequences using the COnsensus-DEgenerate Hybrid
Oligonucleotide Primer (CODEHOP) method.
Characterizing Highly Expressed Genes of Diverse Genomes
Samuel Karlin
Department of Mathematics, Stanford University
Intense current biotechnological research centers on macro and micro
(DNA chip) arrays aiming to dissect gene expression under clinical and
environmentally relevant conditions. Our approach in ascertaining gene
expression levels relates to codon usage differences of genes relative to
one or several sets of standard gene classes. Genes that deviate
strongly in codon usage from the average gene but are relatively similar
in codon usage to ribosomal protein genes (RPs) tend to be "highly
expressed." By these criteria, "highly expressed" genes in most
prokaryotic genomes include RPs, transcription and translation processing
factors, chaperonins, and genes of principal energy metabolism. In
particular, for the fast-growing E. coli, H. influenzae, and B. subtilis
species (and S. cerevisiae), major glycolysis genes are highly expressed.
In Synechocystis, genes of photosynthesis are highly expressed and in
methanogens highly expressed genes are those generally essential for
methanogenesis. A gene is referred to as alien if its codon usage
difference from the average gene of the genome exceeds a high threshold
and codon usage difference from RP genes is also high. Such genes with
high codon bias might be acquired through recent horizontal transfer or
might be deviant because of other disrupting influences. Thus, alien
genes in Haemophilus influenzae include genes of the restriction systems
HindIII, HincII, and HgeD generally acquired through plasmid integration.
The identification of alien genes is of great interest with regard to
pathogenicity and, in the context of evolutionary biology with regard to
lateral gene transfer. In this talk we will describe our methods with
applications of characterizing highly expressed genes in diverse
prokaryotic and eukaryotic organisms.
Genetic Headroom
Jeffrey Lawrence
Department of Biological Sciences, University of Pittsburgh
Why do some groups of organisms contain more species than others? What
factors allow for diversification of some lineages, but others? These
questions can be addressed in bacteria using a genomic approach.
Horizontal genetic transfer provides a powerful mechanism by which an
organism can acquire novel metabolic capabilities, exploit novel
environments, and diversify from parental and sibling lineages. I have
described methods for assessing the amount of horizontally transferred DNA
in a genome, and for estimating the date of introgression of each segment;
these methods allow quantitation of the rate of horizontal genetic transfer
in a particular lineage. A question remains as to what limits the
maintenance of horizontally transferred DNA in a genome. Population
genetics (and common sense) tells us that a finite amount of information
can be maintained in a species' genome. I describe here methods of
assessing the apportionment of natural selection in the counterselection of
mutations at a variety of nucleotide sites. This method provides an
estimate of the relative selective "effort" an organism expends in
maintaining different kinds of genomic information. A portion of this
information includes a variety of sites where non-lethal mutations are
being counterselected (like those affecting synonymous codon choice or
codon context). This parameter is termed "genetic headroom," and
represents the genomic potential for maintaining the protein-coding
capacity of additional genes when they introduced by horizontal transfer.
Genomic surveys show that the amount of genetic headroom in a genome is
indeed correlated to the rate of horizontal transfer. A model is presented
where genetic headroom allows an organism to exploit the novel metabolic
functions introduced by horizontal transfer, and may be an accurate
predictor of speciation rates.
Evolution Teaches Structure Prediction
Burkhard Rost
Columbia University, New York
In the wake of the genome data flow, we need - more urgently than ever -
accurate tools to predict protein structure. The problem of predicting
protein structure from sequence remains fundamentally unsolved despite
more than three decades of intensive research effort. However, the
wealth of evolutionary information deposited in current databases
enabled a significant improvement for methods predicting protein
structure in 1D: secondary structure, transmembrane helices, and solvent
accessibility. In particular, the combination of evolutionary
information with neural networks proved extremely successful. The new
generation of prediction methods proved to be accurate and reliable
enough to be useful in genome analysis, and in experimental structure
determination. Moreover, the new generation of theoretical methods is
increasingly influencing experiments in molecular biology, and ready to
digest large amounts of sequence data.
CATH Protein Families: Insights into Structural Evolution and Function
Orengo, C.A, Pearl, F., Lee, D., Bray, J., Todd, A. & Thornton, J.M.
Biomolecular Structure and Modelling Unit, University College, London
The rapid progress of the international genome projects has resulted in a
wealth of sequence information for protein families. There are nearly
500,000 sequences known which can be grouped into ~20,000 families.
Although the structural data has lagged behind the sequences, analysis
suggests that with the advent of structural genomics initiatives we may
soon have structural representatives for many evolutionary protein
families. Since structure can provide crucial information on a protein's
biological role, the next step is to accumulate and analyse functional
data within these families.
At UCL, we have clustered all the well-resolved protein structures, in the
PDB, into structural families. Proteins are first divided into separate
domains and both sequence and structure alignment methods used to identify
relationships. Data on families is stored within the CATH database (Class,
Architecture, Topology or fold and Homologous superfamily). To date, there
are ~18,000 domains within CATH, which cluster into ~1100 homologous
superfamilies and ~650 fold families.
Techniques are now being developed for assigning structural families to
genome sequences. For example, we can identify probable families for >30%
of proteins in Mycoplasma Genitalium (Salamov et al. 1999). Tools are also
being developed for analysing available functional data within each
homologous superfamily. Protein ligand interactions are identified and
displayed (DOMPLOT, Todd et al. 1998) and correlations between sequence
and structure motifs are captured using a new analysis program (CORA,
Orengo, 1999).
Comparitive Protein Structure Modeling in Genomics
Speaker: Andrej Sali
Roberto Sanchez, Francisco Melo, Ursula Pieper, Nebojsa Mirkovic,
Marc Marti-Renom, Andras Fiser, Ashley Stuart and Andrej Sali
The Rockefeller University
Structural genomics aims to determine or accurately predict 3D structure
of most proteins (1). This aim will be achieved by a focused, large-scale
determination of protein structures by X-ray crystallography and NMR
spectroscopy, combined efficiently with accurate protein structure
prediction. Comparative protein structure modeling will be discussed in
this context. To allow large-scale modeling, we automated fold assignment,
sequence-structure alignment, comparative model building, and model
evaluation (2-5). These steps were implemented mostly in our program
Modeller, which is available on the Web at http://guitar.rockefeller.edu/.
The approach has so far been applied to the proteins in 23 complete genomes,
including the C. elegans genome. All-atom 3D models for substantial segments
of 20-45% of the proteins in the 19 genomes have been obtained and stored
in the ModBase database of comparative protein structure models (5,6).
Several examples of how comparative modeling can be useful in the
biological analysis of individual proteins as well as whole genomes will
be described.
1. A. Sali. 100,000 protein structures for the biologist.
Nature Structural Biology 5, 1029-1032. 1998.
2. A. Sali and T.L. Blundell. Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993.
3. R. Sanchez and A. Sali. Large-scale protein structure modeling of
the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95,
13597-13602, 1998.
4. R. Sanchez and A. Sali. Comparative protein structure modeling in genomics.
J. Comp. Phys. 150, 1-14, 1999.
5. URL http://guitar.rockefeller.edu:/modbase/.
6. R. Sanchez and A. Sali. ModBase: A database of comparative protein
structure models. Bioinformatics, in press.
Defining Domains in Protein Structure
William Taylor
MRC National Institute of Medical Research, London, UK
A simple method for the definition of protein structural domains is described
that requires only alpha-carbon coordinate data. The basic method, which
encodes no specific aspects of protein structure, captures the essence of most
domains but does not give high enough priority to the integrity of beta-sheet
structure. This aspect was encouraged both by a bias toward attaining intact
sheetss and also as as an acceptance condition on the final result. The method
has only one variable parameter, reflecting the granularity level of the
domains, and an attempt was made to set this level automatically for each
protein based on the best agreement attained between the domains predicted on
the native structure and a set of smoothed coordinates. While not perfect,
this feature allowed some tightly packed domains to be separated that would
have remained undivided had the best fixed granularity level been used. The
quality of the results was high and when compared to a large collection of
accepted domain definitions, only a few could be said to be clearly incorrect.
The simplicity of the method allows its easy extension to the simultaneous
definition of domains across related structures in a way that does not
involve loss of detail trough averaging the structures. This was found to
be a useful approach to reconciling differences among structural family members.
The method is fast, taking about 1 second per 100 residues for smaller proteins.
CONTRIBUTED LECTURE ABSTRACTS
Structural Basis for Triplet Repeat Disorders: A Computational Analysis
Speaker: Soren Brunak
Pierre Baldi, Soren Brunak, Yves Chauvin, and Anders Gorm Pedersen
Technical University of Denmark, Lyngby, Denmark
Over a dozen major degenerative disorders, including myototonic distrophy,
Huntington's disease, and fragile X syndrome, result from unstable
expansions
of particular trinucleotides. Remarkably, only some of all the possible
triplets, namely CAG/CTG, CGG/CCG and GAA/TTC, have been associated with the
known pathological expansions. This raises some basic questions at the DNA
level. Why do particular triplets seem to be singled out? What is the
mechanism
for their expansion and how does it depend on the triplet itself? Could
other
triplets or longer repeats be involved in other diseases?
Using several different computational models of DNA
structure, we show that the triplets involved in the pathological repeats
generally fall into extreme classes. Thus, CAG/CTG repeats are particularly
flexible, whereas GCC, CGG and GAA repeats appear to display both flexible
and
rigid (but curved) characteristics depending on the method of analysis. The
fact that (1) trinucleotide repeats often become increasingly unstable when
they exceed a length of approximately 50 repeats, and (2) repeated 12-mers
display a similar increase in instability above 13 repeats, together suggest
that approximately 150 bp is a general threshold length for repeat
instability.
Since this is about the length of DNA wrapped up in a single nucleosome core
particle, we speculate that chromatin structure may play an important role
in
the expansion mechanism. We furthermore suggest that expansion of a
dodecamer
repeat, which we predict to have very high flexibility, may play a role in
the
pathogenesis of the neurodegenerative disorder multiple system atrophy
(MSA).
Evaluation of gene finding programs using gene contigs and application to A. thaliana
Speaker Pierre Rouzé
Nathalie Pavy, Stephane Rombauts, Patrice Déhais, Catherine Mathé, Davuluri V. V. Ramana, Philippe Leroy and Pierre Rouzé
VIB Department of Plant Genetics & INRA-associated Laboratory, University of Ghent, Belgium
Performing routine annotation of genome sequences is a huge and difficult task. Due to the complexity of gene structure and lack
of homologues for many genes, even the first part of this task, locating the genes along the sequence and more particularly finding
their proper borders, remains a problem for eucaryotic genomes, plants being no exception. To improve the annotation process,
one need to choose and combine the most appropriate tools to use inside a computer-assisted annotation platform. In silico gene
prediction has to deal with DNA sequences with genes occurring on both strands and with intergenic regions, and the ability of
the programs to build multi-gene models, and not only exons, has to be evaluated.
We have developed AraSet, a data set of authentic contigs of validated genes, enabling the evaluation of multi-gene models for
the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures
were introduced to consider the prediction at the protein sequence level as well as to evaluate gene models. This evaluation
method can apply to any new gene prediction software and to any eukaryotic genome for which a similar dataset can be built.
Using this evaluation for the Arabidopsis genome, we show that if the accuracies of splice site and exon prediction are quite high,
gene modeling accuracy remains very low. From our analysis, the performance of the publicly available prediction programs for the
Arabidopsis genome differs significantly, suggesting GeneMark.hmm as presently the most accurate software at all three levels.
Interestingly, gene modeling could be further improved by specific combination of prediction software.
The AraSet sequence set and validation programs are available at http://sphinx.rug.ac.be:8080/biocomp/napav/.
Genome Rearrangement with Gene Families
David Sankoff
Universite de Montreal, Montreal, Canada
The theory and practice of genome
rearrangement analysis breaks down in the biologically widespread contexts
where each gene may be present in a number of copies, not necessarily
contiguous. In some of these contexts it is, however, appropriate to ask
which members of each gene family in two genomes
are its 'true exemplars', i.e. which best reflect the
original position of the ancestral gene in the common ancestor genome.
This entails a search for the two 'exemplar strings' of same length
having the smallest
possible rearrangement distance: the 'exemplar distance'.
A branch and bound algorithm calculates these
distances efficiently when based on easily calculated traditional
rearrangement distances, such as signed reversals distance or breakpoint
distance, which also satisfy a property of monotonicity in the number of
genes.
A Simple Algorithm for Detecting Circular Permutations in Proteins
Ron Unger
Bar Ilan University, Tel-Aviv, Israel
A circular permutation of a protein is a genetic event
in which part of the C-terminal of the protein is moved to
its N-terminal.
Recently, it has been shown that proteins that undergo
engineered circular permutations generally maintain their
three dimensional structure and biological function.
This observation raises the possibility that circular permutation
has been used by Nature during evolution. In this
scenario a protein underwent circular permutation
into another protein, thereafter
both proteins further diverged by standard genetic operations.
To study this
possibility one needs an efficient algorithm that for a
given pair of proteins can detect the underlying event of
circular permutations.
A naive algorithm
might take time proportional to N**3 or even N**4,
which is prohibitively slow for a large scale survey.
A sophisticated algorithm that runs in
asymptotic time of N**2
was recently suggested, but it is not practical for
a large scale survey.
Here we present a simple and efficient algorithm that
runs in time N**2. The algorithm is based
on duplicating one of the two sequences, and then
performing a modified version of the standard dynamic
programming algorithm. While the algorithm is not
guaranteed to find the optimal results, we present
data that indicate that in practice the algorithm
performs very well. The algorithm was used as a part
of a large scale survey of Swissprot for
circular permutations.
Few interesting examples will be discussed.