Ensembl, Annotation of Large Metazoan Genomes
Ewan Birney
European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SA, UK
The Ensembl project (based at
www.ensembl.org)
aims to provide an entirely open suite of data and software for large eukaryotic genomes.
Ensembl provides an actively maintained dataset for the human and mouse genomes.
All the data produced by Ensembl is placed in the public domain; the software is
licensed under an extremely open Apache style license. There are over 20 sites with an
externally running copy of the Ensembl web server and two sites with a full installation
of the web site and underlying analysis system.
Building Ensembl has required facing many bioinformatics and software engineering
challenges, from algorithmical issues for gene prediction, through software engineering
challenges for large scale compute management to user interface design. In my talk
I will introduce the Ensembl database and its uses, at the same time touching on some
of the challenges we have met whilst building the system.
Comparative Analysis of Protein Lengths and Amino Acid Usages among the Three Domains of Life
Luciano Brocchieri
Department of Mathematics, Stanford University, Stanford, CA 94305-2125 USA
We analyze variation of protein lengths and amino acid usages in the proteomes from
5 complete eukaryotic genomes, 11 archaeal genomes, and 36 bacterial genomes. Protein
lengths of eukaryotes are larger (median range 346-384aa) than from bacteria (260-295aa),
while archaeal proteins are the smallest (generally 230-250aa). The greater length of
eukaryotic proteins probably reflects an intrinsic greater complexity of their structure
and function (multifunctionality, intron-exon structure, alternative splicing). Comparing
quantile distributions of amino acids, among acidic residues glutamate (Glu) is pervasively
more used than aspartate (Asp) in most species among the three classes. A few exceptions
pertain to prokaryotic species of high G+C genomic content. Among these, Halobacterium sp.
exhibits an exceptionally high frequency of Asp, probably as an adaptation to high salt
concentrations. The median usage of hydrophobic residues in eukaryotes is lower than in
virtually all prokaryotes. In particular, eukaryotes have a lower than expected frequency
of isoleucine compared to prokaryotes. Among eukaryotes, human sequences have lower frequency
of asparagine, a phenomenon that might relate to the peculiar absence of runs of this amino
acid in human sequences.
The frequencies of amino acids encoded by strong bases {Ala, Gly, Pro) are positively
correlated with the genomic G+C content and those encoded by weak bases {Lys, Ile, Phe,
Tyr, Asn, but not Met} are negatively correlated. Amino acid usages are also studied and
compared restricted to specific functional classes (transcriptional classes, DNA replication
and repair, chaperones, etc.), among proteins conserved in all species, and within regions
of low or high conservation. Results are discussed in relation to their functional
interpretation and their implications for phylogenetic studies.
Predicting Splicing Enhancers
Chris Burge
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA
RNA splicing is an essential step in the expression of most eukaryotic genes.
An important goal of research on this process is to determine a set of rules,
perhaps encoded in a computer algorithm, that accurately predicts the splicing
pattern of primary transcripts. I will discuss some of our recent work on
this problem focusing on:
1) modeling the splicing of short introns in five
different organisms (yeast, fly, worm, mustard weed and human);
2) a computational method for predicting which short oligonucleotides function
as exonic splicing enhancers and some preliminary experimental data testing the
function of candidate enhancer motifs.
Isochore Organization of Mammalian Genomes: Selection or Neutral Evolution?
Laurent Duret, Christian Gautier, Dominique Mouchiroud
Laboratoire de Biometrie et Biologie Evolutive, UMR 5558 – CNRS,
Universite Claude Bernard, 43, Bd du 11 Novembre 1918, 69622 Villeurbanne cedex, FRANCE
Pioneer works by Bernardi and colleagues in the 70's have demonstrated that the
base composition is spatially structured in mammalian genomes: chromosomes can be
seen as mosaics of long (>300kb) GC-rich and GC-poor fragments called isochores.
The sequencing of the human genome confirmed the existence of substantial variations
in average GC-content among large fragments (from 33% to 62% G+C), although these
isochores do not appear to be as homogeneous as was expected according to Bernardi's
model. The isochore structure is correlated with various genomic features, including
repeat element distribution, methylation pattern, replication, recombination and,
most remarkably, gene density. However, the biological significance of this large-scale
variation in GC-content remains highly debated. Does the isochore organization result
of a selective pressure on base composition or does it simply reflect a neutral
evolutionary process? We will present recent results on the dating of the origin of
GC-rich isochores in amniotes, on the relationships between isochores and gene expression
patterns, and on the variation of mutation and substitution patterns along chromosomes
(analyses of polymorphism data and substitution in pseudogenes). We will discuss the
selectionist and neutralist models in the light of these new results.
Integrative Genomics beyond the Genes: Computational Analyses of Pseudogenes and Expression Data
M Gerstein, P Harrison, J Qian, V Alexandrov, P Bertone, R Das,
D Greenbaum, R Jansen, W Krebs, N Echols, J Lin, C Wilson, A Drawid,
Z Zhang, Y Kluger, N Lan, N Luscombe
Molecular Biophysics & Biochemistry Department, Yale University, New Haven, CT 06520 USA
I will talk about using the properties and attributes of proteins in two different
types of large-scale genomic analyses. First, I will survey the occurrence of
pseudogenes in several large eukaryotic genomes, focussing on grouping them into
families and functional categories and comparing these groupings with those of
existing "living" genes. Second, I will talk about using protein catgories and
features to mine the data from microarray experiments. In particular, I will
present a new method of clustering expression timecourses that finds "time-shifted"
relationships and also a Bayesian method of predicting subcellular localization from
expression data.
References
J Qian, B Stenger, CA Wilson, J Lin, R Jansen, SA Teichmann, J Park, WG Krebs,
H Yu, V Alexandrov, N Echols, M Gerstein (2001). "PartsList: a web-based
system for dynamically ranking protein folds based on disparate attributes,
including whole-genome expression and interaction information.,"
Nucleic Acids Res 29: 1750-64
PM Harrison, N Echols, M Gerstein (2001). "Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome.," Nucleic Acids Res 29: 818-30
A Drawid , M Gerstein (2000). "A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome." J Mol Biol 301 : 1059-75
R Jansen , M Gerstein (2000). "Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins." Nucleic Acids Res 28 : 1481-8
Remote Homology Detection and Protein Classification
Nick V. Grishin
Howard Hughes Medical Institute/Dept. of Biochemistry, Rm. L4.247A,
University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd.,
Dallas, TX, 75390-9050 USA
Approaches integrating sequence, structure and functional information with evolutionary
considerations have been proven to be most efficient for understanding weak similarities
between proteins. Several examples of remote homology detection using combination of
computational methods will be discussed. In particular, power of transitive sequence
similarity searches in reliable detection of homologs at close to and below random
sequence identity will be illustrated. Several pairs of proteins with statistically
supported sequence similarity that adopt different structural folds will be shown.
Testing Hypotheses of Genome Duplication
Austin L. Hughes
Department of Biological Sciences, University of South Carolina, Columbia SC 29208 USA
Though widely cited, the hypothesis of ancient genome duplication, particularly in
the case of the vertebrates, has only recently been tested rigorously. The availability
of complete or nearly complete genomic sequences for a number of eukaryotic species will
greatly facilitate rigorous testing of genome duplication hypotheses. We review evidence
from a number of recent tests of these hypotheses, particularly of the hypothesis that
the vertebrates underwent two rounds of genome duplication by polyploidization early in
their history (the 2R hypothesis). Tests of the 2R hypothesis include the following:
(1) comparison of gene numbers in homologous families between vertebrate and invertebrate genomes;
(2) phylogenetic tests of the hypothesis that genes in apparently duplicated genomic blocks
actually duplicated simultaneously, as expected to occur by polyploidization;
(3) phylogenetic tests of the hypothesis that gene duplications in 4-member vertebrate families
occurred early in vertebrate history, as predicted by the 2R hypothesis;
(4) phylogenetic tests of the hypothesis that four-member families have the topology of
the form predicted under the 2R hypothesis; i.e., a topology of the form (AB) (CD)
or two clusters of two. Application of these approaches to the complete human genome
provides no support for the 2R hypothesis. On the contrary, these analyses suggest that
genomes are structured in ways that are so far largely unexplored.
A Direct Estimate of Human per Nucleotide Spontaneous Mutation Rates at 20 Loci
Causing Mendelian Diseases
Alexey S. Kondrashov
National Center for Biotechnology Information, NIH, 45 Center Drive, MSC 6510, Bethesda, MD 20892, USA
I estimate per nucleotide rates of spontaneous mutations of different kinds in humans from
the data on per locus mutation rates and on sequences of
de novo loss-of-function
alleles at 8 loci causing autosomal dominant and 12 loci causing X-linked diseases.
I use only those mutations that surely inactivate the locus, i. e. nonsense nucleotide
substitutions and frameshifts. For most of the loci, estimates of the combined rate of
all mutations are between 1x10^-8 and 3x10^-8. Coefficient of variation of per nucleotide
mutation rates at different loci is much smaller than that of per locus rates, and there is
only a slight tendency for loci with high per locus rates to have also high per nucleotide rates.
Substitutions are much more common than length difference mutations, and deletions are ~4 times
more common than insertions. Mutation hot spots with per nucleotide rates above 10^-6 make
only a minor contribution to the overall human mutation. There is a close agreement between
direct estimates of per nucleotide mutation rates and their indirect estimates, obtained by
comparison of human and chimpanzee pseudogenes. Thus, a human zygote carries at least ~100
de novo mutations, and perhaps over 10 of them are deleterious.
Back to RNA World through Comparative Genomics
Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda MD 20894, USA
With multiple, complete genome sequences now available for representatives of all
three primary kingdoms, bacteria, archaea and eukaryotes, a reliable reconstruction
of the protein repertoire of the Last Universal Common Ancestor (LUCA) becomes possible.
Examination of this ancestral protein set reveals many groups of paralogs, which provides
for the reconstruction of even earlier stages of evolution, leading back to the
ribozyme-dominated RNA world, from which the first proteins emerged. The results of
evolutionary reconstructions for central components of the translation machinery, such as
aminoacyl-tRNA synthetases and translation factors, indicate that a substantial diversity
of protein domains has evolved even before the modern-type translation system, based on
protein catalysts, was established. The reconstruction of pre-LUCA steps of evolution
indicates that the first proteins included a small set of low-specificity, multifunctional,
RNA-binding and nucleotide-binding domains, which facilitated RNA replication and translation.
Origin and Evolution of Meiosis: Complex Machinery of Sex
John M. Logsdon, Jr.
Department of Biology, Emory University, Atlanta GA 30322;
jlogsdon@biology.emory.edu
The origin and evolution of sexual reproduction in eukaryotes is a major unsolved
puzzle for biology. In particular, questions about the initial establishment of
meiosis-the central process by which sexual reproduction proceeds-are largely
unanswered. Molecular phylogenetic studies have been initiated for meiotic genes
obtained from a wide range of species, focusing mainly on protozoa which represent
most of eukaryotic phylogenetic diversity. When in the history of eukaryotes did
meiosis arise? The answer to this question will provide a phylogenetic framework
for further studies to understand the origin, evolution and function of meiotic
sex. To initiate this work, we are studying genes for the eukaryotic homologs
of the bacterial recombinase, recA. Two major eukaryotic recA paralogs, RAD51
and DMC1, have been isolated from a diversity of protists. In addition to the
RAD51/DMC1 gene duplication serving as a probable marker for the origin of meiosis,
the presence of DMC1 in species diverging after the duplication may itself indicate
sexuality (with its absence suggesting asexuality). Our results indicate that the
RAD51/DMC1 gene duplication occurred early in eukaryotic evolution: prior to the
divergence of those protist lineages from which recA genes were obtained. If the
RAD51/DMC1 gene duplication is coincident with the origin of meiosis, this indicates
that either meiosis is a process ancestral to all extant eukaryotes or that we have
not yet sampled key protist species representing early-diverging lineage(s).
Giardia lamblia is a putatively deep-branching, possibly asexual species and
perhaps the best candidate for a primitively ameiotic group. Surprisingly, it encodes
two paralogs of DMC1 but it is also the only species in which RAD51 has not been found.
The presence of DMC1 genes indicates that
G. lamblia may be cryptically sexual.
In fact, starting from the partially-sequenced
G. lamblia genome and using a combination
of bioinformatic and directed isolation methods, we have sequenced a number of
additional meiotic genes from
G. lamblia. These and other results have allowed us
to begin describing a conserved "core" meiotic machinery in eukaryotes. Progress on
the isolation and analysis of additional meiotic genes using bioinformatic and
directed efforts will be presented along with some considerations on the origin
of the meiotic machinery itself.
From Database Information to Prediction of Protein-DNA and Protein-Protein Interaction
Hanah Margalit
Department of Molecular Genetics and Biotechnology, Faculty of Medicine,
The Hebrew University of Jerusalem, P.O.B. 12272, Ein Kerem Jerusalem 91120 ISRAEL
The data accumulated in biological databases present a challenge to extract biological
insight from this information, and to use this knowledge in prediction. Here we demonstrate
how we have addressed this challenge regarding two major questions in molecular biology,
of protein-DNA recognition and of protein-protein interaction. For the protein-DNA
recognition problem we extracted information from crystallographically solved protein-DNA
complexes and from databases of transcription factors and their binding sites.
We demonstrate how these types of information have allowed us to derive quantitative parameters
for amino acid-base interaction, which can be used in turn for prediction of transcription factor
binding sites in gene-upstream regions. In regard to the protein-protein interaction problem we
have demonstrated that characteristic pairs of sequence-signatures can be learned from a database
of experimentally determined interacting proteins, where one protein contains the one
sequence-signature and its interacting partner contains the other sequence-signature.
It is proposed that these correlated sequence-signatures can be used as markers for predicting
putative pairs of interacting proteins in the cell.
The Natural History of Domains
Chris Ponting
MRC Functional Genetics Unit, University of Oxford, Department of
Human Anatomy and Genetics, South Parks Road, Oxford OX1 3QX, UK
Domains have represented the most persevering units of protein structure throughout
evolution. Fusions with other domains, to form repertoires of domain architectures,
and other mutational events have contributed greatly to functional innovation. The
wealth of sequence and structure data available from diverse species now allows us
to trace the propagation of domains from ancient times to the present day. Such studies
show that the majority of domain families are demonstrably ancient and that sequence
divergence has masked the long-standing heritage of many of the remaining families.
Modern sequences even hint at the protein structures of the pre-domain world.
The abundance of short repeat-containing domains and, more rarely, inserted motifs
argues in favour of the evolution of modern single polypeptide domains from ancient
short peptide ancestors. These findings argue that there is a need for domain families
to be classified within a hierarchy similar to Linnaeus' Systema Naturae, the classification
of species.
Genome Archeology Leading to the Characterization and Classification of Transport Proteins
Milton H. Saier, Jr.
Division of Biology, University of California at San Diego, 9500 Gilman Drive,
La Jolla, CA, 92093-0116, USA
In the study of transmembrane transport, molecular phylogeny provides a reliable guide to
protein structure, catalytic and noncatalytic transport mechanisms, mode of energy coupling
and substrate specificity. It also allows prediction of the evolutionary history of a
transporter family, leading to estimations of its age, source, and route of appearance.
Phylogenetic analyses, therefore, provide a rational basis for the characterization and
classification of transporters. A universal classification system has been described,
based on both function and phylogeny, which has been designed to be applicable to all currently
recognized and yet-to-be discovered transport proteins found in living organisms on Earth.
Probabilistic Codes for DNA-Protein Interactions
Gary D. Stormo
Washington University Medical School, St. Louis, MO
The search for a "recognition code" that would allow prediction of high
affinity DNA-protein interactions has continued for over two decades. The
original hope for a simple, deterministic code was undone by the first few
DNA-protein complex structures that were solved crystalligraphically. But
clear preferences for specific combinations of interacting base pairs and
amino acids have led to qualitative rules that are used to explain, and
sometimes to predict, preferred protein-binding site combinations. At the
same time efforts to develop more quantitative relationships have emerged
and shown some success. This talk will describe our approach to determine
a probabilistic code for the interaction of EGR family zinc finger
proteins with DNA binding sites. It will describe the model for interaction
that we employ and the similarities and differences with previous models. It will
also describe the method we use to obtain the maximum likelihood estimates
for the parameters of the model and the status of the current results.
The 2R Hypothesis and the Human Genome Sequence
Kenneth H. Wolfe, Aoife McLysaght, and Karsten Hokamp
Department of Genetics, Smurfit Institute, University of Dublin, Trinity College, Dublin 2, IRELAND
We are investigating whether the draft sequence of the human genome provides support for the 2R hypothesis of two rounds of genome duplication, first proposed by Ohno. Our dataset was release 1.0 of Ensembl (April 2001) which contains 27,615 predicted proteins. After removal of alternative splice variants, highly similar tandem repeat genes, and unmapped genes, we were left with 20,830 proteins encoded by genes that appear on the UCSC Golden Path (Dec. 2000). All-against-all BLASTP searches were carried out on these proteins using a 20-processor Linux cluster. Dot-matrix plots of the results show that the human genome does not contain large duplicated regions on the scale of those found in Saccharomyces cerevisiae or Arabidopsis thaliana. However, many duplicated chromosomal regions can be identified; their number and extent depends greatly on the parameters used to define them. To focus on gene duplications that occurred within the chordate lineage, as envisaged by the 2R hypothesis, we used Drosophila and Caenorhabditis sequences as a heuristic orthology threshold and only searched for duplicated blocks composed of human paralogs that are more similar to each other (by BLASTP E-value) than to their closest invertebrate homologs. We also required a maximum BLASTP expectation value of E ? 1e-7, and a maximum gap size of 30 unduplicated genes between any two paralogs making up a block. Using these parameters we find that the human genome contains many more paralogous regions than expected by chance. Ninety-six pairs of large duplicated regions, each containing at least 6 duplicated genes, cover 44% of the genome. These apparently duplicated chromosomal regions in human are statistically significant, as judged by comparisons to computer simulations where gene locations were randomized.
In an independent search, all gene families in human (not just those in duplicated chromosomal regions) were identified for which Drosophila and Caenorhabditis outgroup sequences were known. Phylogenetic trees drawn from these families, followed by molecular clock estimation of the human gene duplication date, showed a small excess of gene duplications with ages roughly 333–583 Mya (0.4–0.7 ? the age of the split between human and Drosophila). This may indicate some sort of increased duplication activity at that time.
Can Bioinformatics Tackle Signal Transduction?
Igor B. Zhulin
School of Biology, Georgia Institute of Technology, 310 Ferst Drive, Atlanta, GA 30332-0230, USA
One of the major goals of comparative genomics is to predict a biological function for proteins. A variety of bioinformatics tools was developed and successfully implemented in order to achieve this goal. For many enzymes, not only the mode of action, but also exact substrate specificity can be predicted with confidence. However, function prediction for some classes of proteins is difficult due to their mosaic structure and the presence of highly variable domains. Proteins comprising signal transduction pathways are probably the best example to illustrate such a problem. Current annotation of signal transduction proteins in both prokaryotes and eukaryotes is limited to the identification of a superfamily based on the presence of one highly conserved domain, for example a histidine kinase, a response regulator, a bHLH transcription factor, etc. An approach is presented, where accurate predictions of a biological function is achieved by combining bioinformatics tools, such as sensitive similarity searches (PSI-BLAST), conservation patterns of multiple alignments, and phylogenetics, with current biological knowledge on structure and function of individual proteins and domains. Examples of refined predictions of a biological function will be given for several superfamilies of signal transduction proteins, including histidine kinases, response regulators, chemotaxis transducers and guanylyl cyclases/ phosphodiesterases.
Enrichment of Regulatory Signals in Conserved Non-Coding Genomic Sequence
Samuel Levy, Sridhar Hannenhalli and Christopher Workman
Celera Genomics, 45 West Gude Drive, Rockville, MD 20850 USA
Motivation: Whole genome shotgun sequencing strategies generate sequence data prior to the application of assembly methodologies that result in contiguous sequence. Sequence reads can be employed to indicate regions of conservation between closely related species for which only one genome has been assembled. Consequently, by using pairwise sequence alignments methods it is possible to identify novel, non-repetitive, conserved segments in non-coding sequence that exists between the assembled human genome and mouse whole genome shotgun sequencing fragments. Conserved non-coding regions identify potentially functional DNA that could be involved in transcriptional regulation.
Results: Local sequence alignment methods were applied employing mouse fragments and the assembled human genome. In addition, transcription factorbinding site were detected by aligning their corresponding positional weight matrices to the sequence regions. These methods were applied to a set of transcripts corresponding to 502 genes associated with a variety of different human diseases taken from the Online Mendelian Inheritance in Man database. Using statistical arguments we have shown that conserved non-coding segments contain an enrichment of transcription factor binding sites when compared to the sequence background in which the conserved segments are located. This enrichment of binding sites was not observed in coding sequence. Conserved non-coding segments are not extensively repeated in the genome and therefore their identification provides a rapid means of finding genes with related conserved regions, and consequently potentially related regulatory mechanism. Conserved segments in upstream regions are found to contain binding sites that are co-localized in a manner consistent with experimentally known transcription factor pairwise co-occurrences and afford the identification of novel co-occurring TF pairs. This study provides a methodology and more evidence to suggest that conserved non-coding regions are biologically significant since they contain a stastistical enrichment of regulatory signals and pairs of signals that enable the construction of a regulatory models for human genes.
Birth of Scale-Free Molecular Networks and the Number of Distinct DNA and Protein Domains Per Genome
Andrey Rzhetsky (1,2), and Shawn M. Gomez (1)
(1)Columbia Genome Center and (2)Department of Medical Informatics, Columbia University, New York 10032, USA
Motivation: Current growth in the field of genomics has provided a number of exciting approaches to the modeling of evolutionary mechanisms within the genome. Separately, dynamical and statistical analyses of networks such as the World Wide Web and the social interactions existing between humans have shown that these networks can exhibit common fractal properties – including the property of being scale-free. This work attempts to bridge these two fields and demonstrate that the fractal properties of molecular networks are linked to the fractal properties of their underlying genomes.
Results: We suggest a stochastic model capable of describing the evolutionary growth of metabolic or signal-transduction networks. This model generates networks that share important statistical properties (so-called scale-free behavior) with real molecular networks. In particular, the frequency of vertices connected to exactly k other vertices follows a power-law distribution. The shape of this distribution remains invariant to changes in network scale: A small subgraph has the same distribution as the complete graph from which it is derived. Furthermore, the model correctly predicts that the frequencies of distinct DNA and protein domains also follow a power-law distribution. Finally, the model leads to a simple equation linking the total number of different DNA and protein domains in a genome with both the total number of genes and the overall network topology.
Availability: MatLab (MathWorks, Inc.) programs described in this manuscript are available on request from the authors.
Contact:
ar345@columbia.edu
Clustering Protein Sequences - Structure Prediction by Transitive Homology
Eva Bolten,
Alexander Schliep, Sebastian Schneckener, Dietmar Schomburg, and Rainer Schrader
ZPR/ZAIK, University of Cologne, Weyertal 80, 50931 Cologne, GERMANY
It is widely believed that for two proteins A and B a sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood, if transitivity always holds and whether transitivity can be extended ad infinitum. We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a weighted directed graph, where protein sequences constitute vertices and weights correspond to alignment scores effectively scaled by sequence length. This assymetric distance and the subsequent clustering based on strongly connected components seems to be robust with respect to problems caused by increased noise levels in larger databases or multidomain proteins. The method was evaluated on two releases from SCOP and showed a drastic improvement over pair-wise comparisons in terms of detecting remote homologues. We also discuss a very favorable comparison with PSI-Blast.
Detection of cis-element clusters in higher eukaryotic DNA
M.C. Frith(1), U. Hansen(2), and
Z. Weng(3)
(1) Bioinformatics Program, Boston University, 44 Cummington St.,
Boston, MA 02215, USA;
(2)Department of Biology, Boston University, 5 Cummington St., Boston MA 02215, USA;
(3)Department of Biomedical Engineering, Boston University, 44 Cummington St.,
Boston, MA 02215, USA
Motivation: Computational prediction and analysis of transcription regulatory regions in DNA sequences has the potential to accelerate greatly our understanding of how cellular processes are controlled. We present a hidden Markov model based method for detecting regulatory regions in DNA sequences, by searching for clusters of cis -elements.
Results: When applied to regulatory targets of the transcription factor LSF, this method achieves a sensitivity of 67%, while making one prediction per 33 kb of non-repetitive human genomic sequence. When applied to muscle specific regulatory regions, we obtain a sensitivity and prediction rate that compare favorably with one of the best alternative approaches. Our method, which we call Cister, can be used to predict different varieties of regulatory region by searching for clusters of cis -elements of any type chosen by the user. Cister is simple to use and is available on the web.
Availability:
http://sullivan.bu.edu/~mfrith/cister.shtml
Contact:
mfrith@bu.edu;
zhiping@bu.edu
Model-Based Clustering and Data Transformations for Gene Expression Data
Ka Yee Yeung, Walter L. Ruzzo
University of Washington, Department of Computer Science, Box 352350, Seattle, WA,98195 USA
Clustering is a useful exploratory technique for the analysis of gene
expression data. Many different heuristic clustering algorithms have
been proposed in this context. Clustering algorithms based on probability
models offer a principled alternative to heuristic algorithms. In particular,
model-based clustering assumes that the data is generated by a finite mixture
of underlying probability distributions, such as multivariate normal
distributions. The issues of selecting a "good" clustering method and
determining the "correct" number of clusters are reduced to model selection
problems in the probability framework.
We benchmarked the performance of model-based clustering on several synthetic
and real gene expression data sets for which external evaluation criteria were
available. The model-based approach has superior performance on our synthetic
data sets, consistently selecting the correct model and the number of clusters.
On real expression data, the model-based approach produced clusters of quality
comparable to a leading heuristic clustering algorithm, but with the key
advantage of suggesting the number of clusters and an appropriate model.
Prediction of disulfide connectivity in proteins
Piero Fariselli and
Rita Casadio
CIRB Biocomputing Unit, Laboratory of Biophysics, Department of Biology,
University of Bologna, via Irnerio 42, 40126 Bologna, ITALY
Motivation: A major problem in protein structure predictionis the correct location of disulfide bridges in cysteine-rich proteins. In protein-folding prediction, the location of disulfide bridges can strongly reduce the search in the conformational space. Therefore the correct prediction of the disulfide connectivity starting from the protein residue sequencemay also help in predicting its 3D structure.
Results: In this paper we equate the problem of predicting the disulfide connectivity in proteins to a problem of finding the graph matching with the maximum weight. The graph vertices are the residues of cysteine-forming disulfide bridges, and the weight edges are contact potentials. In order to solve this problem we develop and test different residue contact potentials. The best performing one, based on the Edmonds–Gabow algorithm and Monte-Carlo simulated annealing reaches an accuracy significantly higher than that obtained with a general mean force contact potential. Significantly, in the case of proteins with four disulfide bonds in the structure, the accuracy is 17 times higher than that of a random predictor. The method presented here can be used to locate putative disulfide bridges in protein-folding.
Availability: The program is available upon request from the authors.
Contact:
Casadio@alma.unibo.it;
Piero@biocomp.unibo.it
DIANA-EST: a statistical analysis
Artemis G. Hatzigeorgiou(1,2), Petko Fiziev(1) and Martin Reczko(2)
(1)Metagen GmbH, Ihnestr.63, 14195 Berlin, Germany and
(2) Synaptic Ltd, Science and Technology Park of Crete, PO Box 1447, Voutes Heraklion, 71110 Greece
Motivation: Expressed Sequence Tags (ESTs) are next to cDNA sequences as the most
direct way to locate in silico the genes of the genome and determine their structure.
Currently ESTs make up more than 60% of all the database entries. The goal of this
work is the development of a new program called DNA Intelligent Analysis for ESTs
(DIANA-EST) based on a combination of Artificial Neural Networks (ANN) and statistics
for the characterization of the coding regions within ESTs and the reconstruction of
the encoded protein.
Results: 89.7% of the nucleotides from an independent test set with 127 ESTs were
predicted correctly as to whether they are coding or non coding.
Availability: The program is available upon request from the author.
Contact: Present address: Department of Genetics, University of Pennsylvania,
School of Medicine, 475 Clinical Research Building, 415 Curie Boulevard, Philadelphia,
PA 19104-6145, USA.
artemis@pcbi.upenn.edu.
A Biosystems Network Ontology Based on Petri Nets
John Ambrosiano(1) and Joseph S. Oliviera (2)
(1)Los Alamos National Laboratory and (2)Pacific Northwest National Laboratory, USA
Complex biological systems on many levels, from genetic regulatory networks to communities of cells and organisms, can be viewed conceptually as self-regulating control networks. Unfortunately in biology, the diversity of interpretations that must be applied to this simple concept is enormous. This introduces substantial practical difficulties in designing ontologies for biological networks because we want them to be general enough to accommodate a broad range of interpretations, and yet still support data structures that can be customized to specific bioinformatics applications. While there are many good efforts underway to define knowledge ontologies for systems biology [1], we believe that the key to eventual success, that is a truly generic conceptual framework for biosystems networks, remains a challenge.
We will describe a conceptual framework under development for biosystems ontologies based on Petri nets. In the past, Petri nets have been applied successfully in the analysis of complex networks occurring in a number of settings such as parallel computing and manufacturing-distribution systems. Recently, Petri net models have also been applied to biomolecular networks [2,3].
The formal Petri net model has a number of features that appear ideal for capturing fundamental relationships in control systems; and models of biosystems ranging from reaction kinetics to logic circuits seem to map onto them well. Earlier work in "event nets," as these systems were once called, suggests that category theory may provide the formal basis on which to build useful mappings for ontology interchange. This can in turn provide a solid foundation for generic, object-oriented implementations of bioinformatics software that would be capable of handling the complex and diverse data sets expected to emerge from rapidly expanding research in systems biology.
References
1. M. Hucka, A. Finney, H. Sauro, H. Bolouri, "Introduction to the
Systems Biology Workbench," California Institute of Technology (2001).
See:
www.cds.caltech.edu/erato/the_project.html
2. Joseph S. Oliviera, Colin G. Bailey, Janet B. Jones-Oliveira, and David Dixon,
"An algebraic-combinatorial model for the identification and mapping of
biochemical pathways," to appear in Bull. Math. Biol.
3. Peter J.E. Goss, Jean Peccoud, "Quantitative modeling of stochastic systems in molecular biology by using stochastic Petri nets," PNAS, 95, 6750, June 1998.
Prediction of Structure, Function and Evolution of a Putative
β-xylosidase in Escherichia Coli using Bioinformatic Techniques
Anuradha R, L. O. Ingram and J. F. Preston J. F.
Department of Microbiology and Cell Science, IFAS, University of Florida, Gainesville FL 32611
The main objective of this work is to explore, through bioinformatics, the
potential of the putative
E. coli gene (
yagH) to express a
functional enzyme, β-xylosidase. Statistically based sequence similarity
methods are often used to predict protein function. The
yagH gene
shares 52% homology with a 56 kDa functional β-xylosidase (
xynB) present in
B. pumilus. There are at least 10 other xylosidases/arabinofuranosidases
(known and putative) that share homologous domains with yagH and
xynB, also
belonging to Class 43 of glycosyl hydrolases. The
xynB (β-xylosidase/
α-arabinosidase) from
Butyrovibrio fibrisolvens, belonging to GH43, has
been shown to cleave the glycosidic bond with an inversion of anomeric
Conformation. Glycosidases in this class have similar catalytic residues
and hydrolysis occurs with inversion at the anomeric carbon. Based on the
Pfam classification of proteins, these enzymes can be divided into three
domains. The region containing amino acids 127 to 309 includes the catalytic
domain of known enzymes in this family. The evolutionary and functional
implications of the domain architectures were analysed using phylogenetic
bootstrapped NJ trees. The gene
yagH is flanked on one side by
yagG (a
putative permease) and on the other side by
yagI (a putative transcriptional
regulator). Computational predictions strongly suggest the transcription
unit
yagG_yagH to comprise an operon. Codon usage and factorial correspondence
analysis of
E. coli genes show that the
yagH gene belongs to the class III
cluster, which in turn strongly indicates inheritance by horizontal gene
transfer, probably from a Bacillus species. Predicted values of free energy
of folding, isoelectric point and linear charge density, based upon primary
structure, are similar for
B. pumilus and
E. coli, implying similar structure
and function. Most functional restraints on evolutionary divergence operate
at the level of tertiary structure and hence 3 dimensional structures are
more conserved in evolution than are sequences. In the absence of solved
structures, the ROSETTA method (which accounts for both local and non-local
interactions) was used to generate tertiary structures for short peptides of
the catalytic domains with the lowest free energy minimum. The structures
generated have similar secondary structure and folding patterns, allowing
similar catalytic activity. Thus all data generated through computational
predictions indicate that
yagH encodes a functional β-xylosidase.
Reconstructing ORFs for the EST and mRNA Assemblies in the AllGenes Gene Index Project
Vladimir Babenko, Brian Brunk, Jonathan Crabtree, Li Li, Christian Stoeckert
Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA, USA
AllGenes (
www.allgenes.org ) is a gene index
project created at the University of Pennsylvania based on EST and mRNA sequences.
Currently the AllGenes project contains up to a million assembled nucleic acid
sequences (assemblies) for mouse and human. These assembly sequences were generated
from the set of EST and mRNA in GenBank (as of August, 2001) using the CAP4 program
(Paracel). To provide an Open Reading Frames (ORFs) for AllGenes assemblies, four
different programs for ORF reconstruction were compared to select the one with best
combination of performance time and accuracy and the framefinder program
(
www.hgmp.mrc.ac.uk/~gslater/estateman/framefinder.html)
was chosen for further use. We have developed a statistical model for assessing
the a posteriori significance of ORF reconstruction based on the nucleic acid and
ORF lengths. This gives us the ability to identify poor ORFs and thus reduce the
noise from pseudo-coding regions as well as assess the significance of ORF length.
The statistic is based on the Bernoulli extreme value model with Poisson approximation
similar to the p-value statistic implemented in BLAST.
We ran framefinder on 363520 mouse assemblies consisting of 71709 non-singletons and 291811 singletons (assemblies with one input sequence). It took 56 hours to reconstruct and submit mouse ORFs to the GUS database underlying AllGenes on a Dell Dual Pentium III 450 MHz. We identified significant (p<0.05) ORFs in 50934 cases. Approximately half of these ORFs start with methionine (corresponding to the start ATG codon). To evaluate the quality of ORFs obtained we performed blastx and blastp similarity searches against nrdb (
ncbi.nlm.nih.gov) for nucleic acid sequences (assemblies) and ORFs, correspondingly, to validate the results obtained by framefinder. For that we restricted our attention to the ORFs with p_values less than 0.05. We found that in 98% of cases the blastp subjects for ORFs were consistent with the blastx subjects for this subclass. This high degree of consistency provides an internal check of the validity of the translations. Cases where ORFs had no homology to known proteins (11% of the total set with p <0.05) are therefore not likely to be artifacts. We did identify at least some cases when trivial translation is more efficient than using framefinder. A range of illustrative examples is presented underlining the features, caveats and advantages of framefinder application.
Gene Finding Applications of New Models of RNA-mRNA Interactions
John Besemer, Alex Lomsadze and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230 USA
Binding of functional RNAs to mRNA is an important feature of numerous cellular processes including translation initiation, intron splicing and polyadenylation. Typically, multiple sequence alignment algorithms, such as Gibbs sampling and simulating annealing, or maximum likelihood approaches, such as hidden Markov models, are used to elucidate the motifs that play a role in these RNA-RNA interactions. The result of these methods is typically a position specific frequency matrix of nucleotide composition. While models based on these methods have been useful towards the improvement of gene predictions, they are not adequate for understanding the in vivo mechanisms. Here we propose a new approach towards building models for these sites, focusing on the binding between the 16S rRNA and mRNA. These new models were used to improve gene finding accuracy in the most recent versions of GeneMark.hmm and GeneMarkS.
Robust Cluster Analysis of DNA Microarray Data:
An Application of Nonparametric Correlation Dissimilarity
David R. Bickel
Medical College of Georgia, Office of Biostatistics and Bioinformatics
1120 Fifteenth St., AE-3037, Augusta, GA 30912-4900 USA
Several methods have been proposed for using expression data to classify genes into biologically meaningful groups. Although some of these techniques do not explicitly specify their assumptions about the data, the success of each method depends on how well its underlying model describes the patterns of expression. Outlier-resistant and distribution-free clustering of genes can be performed with nonparametric measures of the (dis)similarity of expression values such as intensity ratios or average differences; e.g., a simple robust measure of the similarity between genes is Spearman's correlation, R, computed by ranking the values across microarrays. A dissimilarity metric is then defined as the Euclidean distance D=sqrt{1-[(R+1)/2]^C} or D=sqrt{1-[abs(R)]^C}, C>0. (A distance between the vectors of ranks can also quantify dissimilarity.) Given a (dis)similarity measure, genes can be clustered by optimizing the sum of (dis)similarities of each gene from the closest of k central genes. Each cluster is then described by the range of data for each microarray for a fixed proportion of genes in that cluster that are closest to its central gene; error bars can similarly be computed for other ways of clustering around k central objects. These methods are applied to the data of DeRisi et al. (1997), with an evaluation of the performance of D relative to the analogous distance based on Pearson's correlation of the logs of expression ratios. Such methods are generally applicable to other types of data.
A Distributed Protein Visualization Application
Tolga Can
Department of Computer Science University of California, Santa Barbara, CA, USA
Protein visualization has become increasingly popular especially since the accomplishment of the Human Genome Project. Although there are several visualization software available for scientists, few address the aspect of collaboration, e.g. simultaneous access of the same protein model. Most of the current systems are standalone applications and researches have to share their ideas by exchanging snaphots of the protein models.
We have developed a distributed protein visualization application, in which a protein molecule can be viewed synchronously by many users in different geographical locations. Our system provides different 3D representations existing in many of today's protein visualization systems. These representations include: backbone model, balls-sticks model, space-fill model, and ribbon model. The structure information of protein molecules is obtained in the form of a Protein Data Bank (PDB) file. The 3D models are built as Java3D scene graphs using the atomic coordinate information contained in the PDB file. User can interact with the 3D models using zoom, pan and rotation functions. Furthermore we provide textual information in terms of a "molecule information window" and a "tree view window". The former includes information such as molecule name, number of amino acids in the molecule, the amino acid chain as one-letter symbols, and currently selected amino acid. The latter describes the hiearchy of the protein molecule both in terms of primary structure and the secondary structure. We implemented two way interaction between the hiearchical representation and the 3D models in the following sense. Users can select a sub-structure, e.g. an amino acid or an atom, in the molecule using either the tree view or the 3D view, and the corresponding structure is highlighted in the other view.
A session server handles the communication between the users. Users share the same view of a 3D protein model by using a locking mechanism. Our implementation is based on Java. It allows users from different platforms connect to the same collaboration session.
We plan to add new 3D representations, such as electron density map and solid surface model, into our visualization system. We also consider incorporating a protein folding algorithm, which will enable users not only visualize proteins of unknown structure, but also model and create new proteins on the fly by changing the amino acid sequence.
Genome-wide Comparative Analysis of Transcriptional Regulatory Regions
Yu Chen(1), Victor Olman(2), Ying Xu(1,2), Dong Xu(1,2)
(1)University of Tennessee-ORNL Graduate School of Genome Science and Technology,
Knoxville, TN 37996 USA and (2)Oak Ridge National Laboratory, Oak Ridge, TN, 37831 USA
Transcriptional regulatory network is an indispensable prerequisite for understanding cellular function. However, the evolution of regulatory regions is not well understood compared to the evolution of coding regions. Sequence comparison between the regulatory regions of orthologs and paralogs may provide some insight about the evolution of the regulatory regions. For this purpose, we used the 51 Archaea and bacteria genomes with gene annotations from NCBI. From the COG database, we selected several orthologs that appear in many genomes, such as orthologs of flavohemoprotein. The genes with significant sequence similarity with each other in the same genome are presumed to be paralogs. As expected, gene regulatory regions are less conserved than gene coding regions among orthologs or among paralogs. However, the correlation between the sequence identity in the coding regions and the sequence identity in their regulatory regions is stronger among orthologs than among paralogs. We also carried out a comparative promoter analysis of the genes among the orthologs and paralogs using AlignACE. We found that orthologs have more conserved patterns in their promoter regions than paralogs. The patterns of regulatory regions provide quantitative measurements for the divergence of gene functions among paralogs and the convergence of gene functions among orthologs. It confirmed that the pattern changes of regulatory regions play an important role in genome evolution. We also analyzed the occurrence of dimeric tandem repeats, which are remarkably abundant in eukaryote DNA. We found that gene regulatory regions have stronger strongds long-range correlation of dimeric tandem repeats than coding regions. This suggests that the mutations at the coding regions may be more independent with each other (or with less correlated mutations) than the mutations at the regulatory regions during evolution.
Parallelism between Fusion Peptides and Others Fusion Systems Revealed by an Exhaustive Search for Sequences with Potential for Dynamic Insertion into Membranes
Victoria Dominguez Del Angel, Jean-Paul Mornon and Isabelle Callebaut
Laboratoire de Mineralogie-Cristallographie, CNRS UMR C7590, Universites Pierre et Marie Curie (P6) et Denis Diderot (P7), Case 115, 4, place Jussieu F-75252 Paris, FRANCE
Main aspects of protein function analysis include the detection of functional homologs potentially omitted by simple sequence alignment methods. This study is based on special short fragments (up to 20 amino acids) from viral fusion proteins which sequences are highly conserved within one virus family, but not among different families: the fusion peptides. Fusion peptides are involved in membrane fusion processes. They facilitate membrane fusion by inserting deeply into the lipid bilayer of the target membrane, destabilizing lipids and thus leading to hemifusion. We used a well-characterized, non redundant set of fusion proteins to make statistical analysis and identify amino acids which are preferentially found in fusion system. A preference was found for Alanine, Threonine involved in mobility through the lipids and alpha helical structures and Isoleucine and Methionine, for hydrophobicity. In light of these results, we designed a software for screening the protein database (Swiss-Prot), and searching for putative functionally homologs that could play critical roles in membrane fusion among other viruses families or fusion systems.
Comparing Protein Clustering Methods Using the Arabidopsis Proteome
Christine G. Elsik and William R. Pearson
Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
Protein classification based on pairwise sequence comparison has been used
in comparative genomics, representative protein selection, gene prediction
assessment, and function and structure assignment. We wish to assemble a
comprehensive, but non-redundant, set of protein clusters with homogenous
domain organization within clusters. The simplest method of clustering by
transitive sequence similarity, single linkage, frequently generates clusters
that are much too large, because multidomain proteins have pulled unrelated
proteins into the same cluster. One way to avoid this is to cluster domains
instead of complete proteins. However, some applications benefit from
classification of complete proteins. Consequently, alternative algorithms
for clustering complete proteins have been developed to prevent incorrect
merging of multidomain proteins.
We used the
Arabidopsis proteome
(25,549 proteins) to evaluate the ability of complete protein clustering methods
to minimize total cluster number while maintaining reasonable cluster sizes
and consistent domain organizations. A pairwise similarity measure, E()-value,
was the linkage criterion for four linkage-based methods: single linkage
(similar to the SEALS grouper program and GEANFAMMER's clustering algorithm),
average linkage (similar to ProtoMap), complete linkage and fractional linkage
(similar to grouper options and algorithms used by Celera). We also evaluated
single linkage based on pairwise identity and alignment coverage (similar to
BLASTCLUST). The methods were assessed using a cluster domain consistency
(CDC) score after comparing proteins with Pfam, Smart and ProDom to identify
domains. Each method was tested with a range of linkage thresholds, resulting
in 7702 to 25053 clusters, with 4940 to 24,744 singletons, respectively.
At the most relaxed threshold (E() < 10^-10), single linkage based on
E()-value alone generated the smallest cluster number, with the largest
cluster containing 5852 proteins, which is unlikely to be a biologically
meaningful grouping. At the same threshold, average linkage produced 9607
clusters; the largest cluster included 395 sequences. Although cluster sets
produced by E()-based single linkage had the poorest CDC scores, sets of >
16,000 clusters generated by identity/alignment-based single linkage
(> 40% identity) had the best CDC scores of all methods. Average and
fractional linkage performed better than identity/alignment-based single
linkage for sets with less than 16,000 clusters. Using a conservative
threshold of E() < 10^-10, we found at least 9600 unique domain organizations
in the Arabidopsis proteome.
New Features of FPC (FingerPrinted Contigs) V6.0
F. Engler(1), J. Hatfield(1), S. Blundy(1), S. Ness(2), C. Soderlund(1)
(1)Clemson University Genomics Institute, Clemson University
Clemson, SC, 29634, USA and (2)Genome Sequence Centre, British
Columbia Cancer Agency, 600 West 10th Avenue, Vancouver, BC V5Z 4E6, CANADA
Already used worldwide for the physical mapping of nontrivial genomes such as human, rice, and mouse, FPC (FingerPrinted Contigs) continues to grow. Recent improvements include a port to the Gimp Toolkit graphics library, which provides better graphics and higher compatibility than the previously used Athena widgets. Also, recent additions to the code will exploit parallelism when run on multiprocessor clusters. Three new features have been added that will make FPC even more useful. First, BSS provides the ability to run BLAST searches from FPC and use the results to map sequence back to the physical map. This tool has many applications, including that of picking a minimal tiling path and providing information needed to merge contigs. Second, FPC Simulated Digest takes sequence files as input and digests them in silico, outputting band files needed to add clones to FPC. With this tool, new sequence can be downloaded from global databases such as GenBank, converted to band files, and added as clones to FPC.
Finally, WebFPC provides a Java display for FPC, allowing any user to view the vital information from FPC online, as well as linking the user to databases that contain additional information on selected clones.
In Silico Prediction of the Transcriptional Regulation of Human Genes
M.C. Frith(1), J. Spouge(2), U. Hansen(3), and
Z. Weng(4)
(1) Bioinformatics Program, Boston University, 44 Cummington St.,
Boston, MA 02215, USA;
(2)National Center for Biotechnology Information, National Library of Medicine,
Bethesda, MD 20894, USA;
(3)Department of Biology, Boston University, 5 Cummington St., Boston MA 02215, USA;
(4)Department of Biomedical Engineering, Boston University, 44 Cummington St.,
Boston, MA 02215, USA
The control of gene expression by modulation of transcription rate is one
of the most fundamental processes in human physiology and disease. Transcription
is regulated by the binding of transcription factors to cis-elements in the DNA
sequence, which are often, but not always, located close to the 5' end of the gene.
Since the entire human genome sequence is (almost) available, it ought to be
possible to detect these transcription factor binding signals for any known
human gene. Although traditional methods for cis-element detection can accurately
predict affinities of transcription factors for naked DNA in vitro, they utterly
fail at predicting functional binding sites in vivo. Two ideas for improving this
pessimistic situation are presented in this poster. The first approach is to
detect statistically significant clusters of multiple cis-elements, motivated
by the observation that higher eukaryotic genes are typically regulated by
multiple transcription factors. The second method is to search for conserved
binding sites in alignments of orthologous DNA from human and another species
such as mouse, assuming that cis-elements tend to be conserved across evolution.
Sequencing and Comparison of Orthopoxviruses
Michael Frace, Melissa Olsen-Rasmussen, Roger Morey, Yu Li, Miriam Laker,
Richard Kline, Scott Sammons, Inger Damon, Robert Wohlhueter, Joseph J. Esposito,
Ming Zhang
Centers for Disease Control and Prevention, 1600 Clifton Rd. NE,
MailStop G-36, Atlanta, GA 30333 USA
The genomes of six variola major strains were sequenced by using long-distance,
high-fidelity PCR of overlapping amplicons astemplates for fluorescence-based
sequencing. Each genome is approximately 186 kb of double-stranded DNA with
between 190 and 250 predicted open reading frames (ORFs) of greater than 60
amino acids. The most highly conserved ORFs are located in the center portion
of the genome, and the majority have known functions involving transcription,
DNA replication and repair, protein processing, virion structure, and
nucleotide metabolism.
To minimize the amount of poxvirus needed, 15 mg of purified genomic DNA
was used as template for approximately 1800 primer-walking sequencing
reactions. The reactions were set up using robotic assistance and subjected
to thermocycling, and the reaction products were separated by capillary
electrophoresis (Beckman Coulter CEQ 2000XL). Sequencing of variola strain
Congo-1970, Somalia-1977, India-1964, Horn-1948, Nepal-1973, and
Afganistan-1970 has been completed.
Output sequence trace files were edited, evaluated for quality, and then
assembled by using Phred/Phrap/Consed software until about a 10-fold
redundancy of high-quality sequence data was attained. The ORFs were
predicted using Glimmer, GeneMark, and getorf. Each ORF sequence has
been compared with the five other locally sequenced strains and with
sequences of previously published orthopoxviruses Bangladesh-1975
(L22579), India-1967 (X69198), and vaccinia virus Copenhagen (M35027).
These predicted ORFs were also analyzed for the presence of known early,
middle, and late promoter sequences. All results and analyses will be
integrated into a relational database customized for orthopox viral genomes.
Identification of Sequence and Structural Determinants of Functional Diversification Using Site Specific Amino Acid Variation Profiles
Daniel S. Gonzalez(1), G. Reid Bishop(2) and I. King Jordan(3)
(1)United States Department of Agriculture, Aquatic Animal Health
Research Unit, Auburn, Alabama 36831, USA;
(2)Department of Chemistry, Millsaps College, Jackson, Mississippi 39210, USA;
(3)3National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
The extraction of position specific variability information from alignments of homologous proteins is emerging as a powerful method for gleaning meaningful biological information from sequence data. This approach was employed here to identify sequence and structural elements that have likely been shaped by natural selection during the functional diversification of organelle-specific groups of class I processing a -mannosidases. Class I a -mannosidases make up a homologous and functionally diverse family of glycoside hydrolases. Phylogenetic analysis based on an amino acid sequence alignment of the catalytic domain of class I a -mannosidases reveals four well-supported phylogenetic groups within this family. These groups include a number of paralogous members generated by gene duplications that occurred as far back as the initial divergence of the crown-group of eukaryotes. Three of the four phylogenetic groups consist of enzymes that have group-specific biochemical specificity and sites of activity. An attempt was made to uncover the role that natural selection played in the sequence and structural divergence between the phylogenetically and functionally distinct enoplasmic reticulum (ER) and Golgi apparatus groups. Comparison of site-specific amino acid variability profiles for the ER and Golgi groups revealed statistically significant evidence for functional diversification at the sequence level and indicated a number of residues that are most likely to have played a role in the functional divergence between the two groups. The majority of these sites appear to contain residues that have been fixed within one organelle-specific group by positive selection. Somewhat surprisingly these selected residues map to the periphery of the a -mannosidase catalytic domain tertiary structure. Changes in these peripherally located residues would not seem to have a gross effect on protein function. Thus diversifying selection between the two groups may have acted in a gradual manner consistent with the Darwinian model of natural selection.
Prediction of N-Glycosylation Sites in Proteins.
Ramneek Gupta(1), Eva Jung(2) and Soren Brunak(1).
(1)Center for Biological Sequence Analysis, Bldg-208,
Technical University of Denmark, Lyngby, DENMARK and
(2)Swiss Institute of Bioinformatics, Geneva, SWITZERLAND
Contrary to widespread belief, acceptor sites for N-linked glycosylation on protein sequences, are not well characterised. The consensus sequence, Asn-Xaa-Ser/Thr (where Xaa is not Pro), is known to be a prerequisite for the modification. However, not all of these sequons are modified and it is thus not discriminatory between glycosylated and non-glycosylated asparagines. We train artificial neural networks on the surrounding sequence context, in an attempt to discriminate between acceptor and non-acceptor sequons. In a cross-validated performance, the networks could identify 86% of the glycosylated and 61% of the non-glycosylated sequons, with an overall accuracy of 76%. The method can be optimised for high specificity or high sensitivity. Apart from characterising individual proteins, the prediction method can rapidly scan complete proteomes.
Glycosylation is an important post-translational modification, and is known to influence protein folding, localisation and trafficking, protein solubility, antigenicity, biological activity and half-life, as well as cell-cell interactions. We investigate the spread of known and predicted N-glycosylation sites across functional categories of the human proteome.
An N-glycosylation site predictor for human proteins shall be made available at
www.cbs.dtu.dk/services/NetNGlyc
Prediction of Gene Function within a Family of Related Proteins: A Case Study of the Xanthine Oxidase Family
Nikolai V. Ivanov(1,2) and Dale E. Edmondson(2)
(1)Department of Chemistry and (2)Department of Biochemistry,
Emory University, Atlanta, GA 30322 USA
The problem of correct gene assignments for a number of genomes sequenced up to date is being addressed using numerous methods. Most of the effort is directed toward creating a complete library of families and superfamilies for all known genes and proteins. The finer problem however exists - the function prediction of a gene assigned to a particular superfamily. In this work we present a method allowing to further classify proteins within a family of xanthine oxidase. The method is based on analysis of multiple alignments for characterized proteins of known function as well as site-directed mutagenesis, kinetic and crystallographic data. The multiple alignment data helps to locate the conserved residues of interest within superfamily genes and mutagenesis, kinetic, and crystallographic data provide information on the importance of the conserved residues. Each residue is classified as to have structural significance, or to be involved in cofactor or substrate/ligand binding. Those residues are compared between the enzymes of xanthine oxidase family based on several characteristics: substrate preference, nature of cofactor, and structural variations. A score is attributed to a gene of unknown function to account for the presence of residues characteristic to a particular function or binding site. In order to test this method we took several genes of known function, constructed the knowledge set from the rest of the characterized proteins of xanthine oxidase family. Our poster presents the successes and pitfalls of our prediction. Being able to predict function based on the gene sequence is very important for correct assignment of newly sequenced genes, as well as for prediction and interpretation of the results of site-directed mutagenesis of fairly studied proteins.
Genomic Scale Relative Rates Test and the Detection of Functional Diversification among Bacterial, Archaeal and Eukaryotic Proteins
I. King Jordan, Fyodor A. Kondrashov, Igor B. Rogozin, Roman L. Tatusov, Yuri I. Wolf and Eugene V. Koonin
National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, USA
Detection of changes in a protein's evolutionary rate may reveal cases of
change in that protein's function. We developed and implemented a simple
relative rates test in an attempt to assess the rate constancy of protein
evolution and to detect cases of functional diversification between
orthologous proteins. The test was performed on clusters of orthologous
protein sequences from complete bacterial genomes (
Chlamydia trachomatis,
C. muridarum, and
Chlamydophila pneumoniae), complete archaeal
genomes (
Pyrococcus horikoshii,
P. abyssi, and
P. furiosus) and
partially sequenced mammalian genomes (human, mouse, and rat). Amino acid
sequence evolution rates are significantly correlated on different branches
of phylogenetic trees representing the great majority of analyzed orthologous
protein sets from all three domains of life. However, approximately 1% of
the proteins from each group of species deviates from this pattern and
instead shows variation that is consistent with an acceleration of the
rate of amino acid substitution which may be due to functional diversification.
Most of the putative functionally diversified proteins from all three species
groups are predicted to function at the periphery of the cells and mediate
their interaction with the environment. Relative rates of protein evolution
are remarkably constant for the three species groups analyzed here.
Deviations from this rate constancy are probably due to changes in selective
constraints associated with diversification between orthologs. Functional
diversification between orthologs is thought to be a relatively rare event.
However, the resolution afforded by the test designed specifically for
genomic scale data sets allowed us to identify numerous cases of possible
functional diversification between orthologous proteins.
PlasmoDB: An Example of Using GUS and RAD to Build a Database for Malaria Researchers that Combines Mapping, Sequence and Expression Data.
Jessica C. Kissinger(1), Brian Brunk(2), Jonathan Crabtree(2),
Sharon J. Diskin(2), Martin J. Fraunholz(1), Gregory R. Grant(2),
Dinesh Gupta(1), Shannon. McWeeney(1), Arthur J. Milgram(1),
David S. Roos(1), Jonathan Schug(2), and Christian J. Stoeckert Jr.(2)
(1)Department of Biology and (2)Center for Bioinformatics,
University of Pennsylvania, Philadelphia, PA 19104 USA
PlasmoDB (
PlasmoDB.org) is the official database of the Plasmodium falciparum genome sequencing consortium. The relational schemas used to build PlasmoDB (GUS, Genomics Unified Schema and RAD, RNA Abundance Database) employ a highly structured format to accommodate the diverse data types generated by sequence and expression projects. PlasmoDB currently houses sequence information (both finished and unfinished) from five Plasmodium species, and provides tools for cross-species comparisons. Sequence information is integrated with other genomic-scale data emerging from the Plasmodium research community, including gene expression analysis from EST, SAGE, and microarray projects. A variety of tools allow researchers to formulate complex, biologically-based, queries of the database. A version of the database is also available via CD-ROM (Plasmodium GenePlot), facilitating access to the data in situations where internet access is difficult (e.g. by malaria researchers working in the field). The goal of PlasmoDB is to enhance utilization of the vast quantities of data emerging from genome-scale projects by the global malaria research community.
An Analysis of Gene-Finding Approaches for Neurospora crassa
Eileen Kraemer(1), Jian Wang(1,2), Jinhua Guo(1), Samuel Hopkins(1),
Jonathan Arnold(2)
(1)Computer Science Department and (2)Genetics Department,
The University of Georgia, Athens, GA 30602 USA
Motivation: Computational gene identification plays an important
role in genome projects. The approaches used in gene identification
programs are often tuned to one particular organism, and accuracy for
one organism or class of organism does not necessarily translate to
accurate predictions for other organisms. We evaluated five computer
programs on their ability to locate coding regions and to predict gene
structure in
Neurospora crassa. One of these programs (FFG) was
designed specifically for gene-finding in
Neurospora crassa, but
the model parameters have not yet been fully "tuned", and the program
should thus be viewed as an initial prototype. The other four programs
were neither designed nor tuned for
N. crassa.
We evaluated five programs (GenScan, HMMGene, GeneMark, Pombe and FFG)
on data sets from the University of Mexico, the University of Georgia,
and from the PEDANT database at MIPS(Munich Information Center for
Protein Sequences). Our results show that overall the GenScan program
has the best performance on sensitivity and ME(Missing exons) while
the HMMGene and FFG programs have good performance in locating the
exons roughly. However, the reader is cautioned as to the reliability
of the annotated data sets, as GenScan was used in the annotation
of some sequences.
The importance of evaluating programs based on the particular organism
one wishes to study is clear. Most of the gene-finding programs
evaluated are inappropriate for finding genes in
N. crassa.
Additional work motivated by this study includes the the creation
of a tool for the automated and rapid evaluation of gene-finding
programs, the collection of larger and more reliable data sets for
N. crassa, parameterization of the model used in FFG to produce a more accurate gene-finding program for this species, and a more in-depth evaluation of the reasons that existing programs generally fail for N. crassa.
Links to the programs, data sets, and results may be found at:
jerry.cs.uga.edu/~wang/genefind.html
Automatic Rule Generation for Protein Annotation with the C4.5 Data Mining Algorithm Applied on SWISS-PROT
Ernst Kretschmann
European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SA, UK
The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.
The standard data mining algorithm C4.5 was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.
The results of the automatic data mining process can be browsed on
www.ebi.ac.uk/spearmint
The source code is available upon request.
LumberJack Generates a Forest of Trees by Jackknifing Alignments
Carolyn J. Lawrence (1), R. Kelly Dawe(1,2), and Russell L. Malmberg(2)
(1)Department of Botany and (2)Department of Genetics, University of Georgia, Athens, Georgia USA
Phylogenomics is a method of sequence-based function prediction by phylogenetic analysis (Eisen 1998). The phylogenomic method often yields more accurate functional hypotheses than techniques based solely upon sequence similarity (such as BLAST). It is implemented by constructing a reasonable phylogenetic tree for a given dataset, then mapping the functions of experimentally analyzed proteins onto the tree. While one might prefer to build such trees using ML-based algorithms, most implementations are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack. LumberJack progressively jackknifes an alignment to generate multiple neighbor joining trees, then compares those trees statistically on the basis of their relative likelihood scores. Trees built wherein misleading blocks within the alignment were removed have a better likelihood score than those built using the entire alignment, and the likelihood score is worse for trees built wherein phylogenetically informative blocks of the alignment have been removed. Thus not only does LumberJack quickly generate a distribution of phylogenetic trees for further analyses, it also maps phylogenetic information onto the alignment.Using the kinesin dataset of Lawrence et al. (2001), we find that the most likely tree discovered using the LumberJack protocol is similar to the Lawrence et al. tree (built using ML star decomposition and placing individual sequences by hand). LumberJack also revealed a region of the alignment that mislead both parsimony heuristics and neighbor joining treebuilding.
mRNA Segment Scores of Neurospora crassa Genes Decrease Following Intron Splicing
By Tong Lee(1), April C. Ashford(2), Kaee N. Ross(2), Giovanni Carter(2),
LaTreace Harris(2), and William Seffens(2)
(1)Department of Mathematics, Georgia State University, Atlanta, GA, USA and (2)Department of Biological Sciences, Clark Atlanta University, Atlanta 30314 USA
Free energies of folded mRNAs are usually more negative compared to mononucleotide shuffled sequences (Seffens and Digby, 1999). A segment score is the difference between the folding free energy of an mRNA and the mean free energy of folded shuffled sequences, divided by the standard deviation of the shuffled set. Thirteen genes from Neurospora crassa with introns were studied. mRNAs with introns were found to have segment scores that were more positive (less stable) compared to processed mRNAs. This suggests that intron splicing yields mRNAs that possess more secondary structures that expected compared to mononucleotide shuffled sequences.
Intron splicing is a phenomenon found in eukaryotic genes. Eukaryotic mRNAs exhibit posttranscriptional modifications from large RNA precursors, hnRNA(pre-mRNA), that are acted upon by SnRNPs which form a vital part of the splicesome that processes mRNA. Free energy is released as RNA structures are formed, creating a more stable structure. The activity of RNA is determined by its structure, the way it is folded back on itself; and cases have been described where the secondary structure plays an important role in gene regulation (for instance the trp operon in
E. coli). Thirteen N. crassa mRNA sequences were selected from the GenBank database and analyzed using the GCG Wisconsin package version 10.2-Unix (Oxford Molecular Co.). N. crassa mRNA sequences less than 1200 bases long were randomly selected, and consisted of mRNAs with an identifiable start site, termination signal, and possessed introns. The thirteen mRNA sequences examined in N. crassa that possess introns were found to have an average segment score of –0.074. After intron splicing, the average calculated segment score is –0.276.
Neurospora is typical of many eukaryotes, in that more than 50% of its genes have introns. The significant difference in the segment scores between the hnRNA, and the processed mRNAs is a novel observation. These results may be generally applicable to other organisms. For randomly selected genes from a variety of other organisms the segment score was found to be –1.23 (Seffens and Digby, 1999). However, this observation was for a set of genes that did not possess introns. The decrease in segment scores indicate that processed mRNA has more secondary structures that hnRNA.
This work was performed during an undergraduate summer research experience sponsored by the University of Georgia at Athens
with Dr. Jonathan Arnold (Genetics Department).
References
Seffens, W. and Digby, D. (1999) "mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences" Nuc.Acids Res. 27:1578-1584.
Rule and Dictionary-Based Text Mining to Cross-Reference Genes and Proteins to Relevent Journal Articles, and Clustering of Microarray Results by their Medline Associations Using the CELL Platform
Julie Leonard(1,2), Toby Segaran(1), Hong Dang(1), Jeff Colombe(1),
Jennifer Pan(1), Josh Levy(1)
(1)Incellico, Inc., 2327 Englert Drive, Durham, NC, 27713, USA and
(2)Bioinformatics Department, North Carolina State University, Raleigh, NC, USA
In order to obtain information about a biological entity of interest, such as a gene or a protein, biologists frequently perform a literature search using Medline or PubMed. In many cases, biological entities have more than one name or alias, requiring multiple literature searches. Consequently, finding all relevant literature references for a given list of biological entities can be very labor-intensive. In order to streamline this process, we have used both rule- and dictionary-based text mining methods to cross-reference biological entities with relavent articles in the scientific literature. The dictionary was constructed from gene and protein symbols in HUGO, OMIM, TAIR and LocusLink, while the rules for finding symbols in Medline abstracts were empirically determined using a training set of Medline entries. Included are rules that only take genetics-related abstracts, use organism-specific dictionaries, and add dictionary terms during analysis. We have cross-referenced the complete set of Medline abstracts with biological entities from human, mouse, drosophila, zebrafish, arabidopsis and rat. The result is a rich network of searchable cross-references between biological entities and literature entries, much more comprehensive than the set of literature cross-references contained in Genbank and Swissprot. Links between biological entities and literature references were given a score based on a statistical model for each biological symbol’s relevance to each abstract. The model was based on the comparison of the occurance of significant words in an abstract above the Medline background set of occurances with other abstracts containing that symbol. Using our methods, we obtained a precision rate 89.9% of and a recall rate of 91.5% using a small annotated test set derived from Medline. We also demonstrate the utility of such a rich network of cross-references by showing the results of clustering gene expression data by both expression and co-occurence of those genes in Medline abstracts using the CELL platform.
Application of Error-Driven Learning to Biologically Significant
Patterns in Protein Sequences
Sergei Levin and Birgit H. Satir.
Department Of Anatomy and Structural Biology, Albert Einstein College of Medicine, Bronx, NY 10064 USA
Biologically significant protein sequence patterns, especially the ones responsible for post-translational modification and signaling, are sometimes highly variable and difficult to pinpoint with the naked eye. Thus, automated acquisition of correct consensus sequences from protein sequences is a very important task. We used error-driven learning to acquire protein consensus sequences for N and O glycosylation from pre-annotated protein sequences. The error-driven learning starts with learning of the base pattern, an amino-acid pattern that is present in all sequences with positive annotation. In case of glycosylation, it is a sequence always present at the glycosylation site. Next, the default value for the base pattern is learned, which implies that base pattern can either be specific to glycosylation (most frequently met at glycosylation sites) or non-specific (most frequently met at non-glycosylation sites). Subsequently, the alterations to the base pattern that lead to non-default assignments are learned via an iterative application of the base pattern and analysis of the cases where the assumption of the default value was incorrect. These are the patterns that specifically lead to glycosylation and are learned based on the errors made while pplying the base pattern. Application of the error-driven learning to the protein sequences for proteins with N and O glycosylation revealed that the base pattern is very simple and has a non-specific default value (no glycosylation). However, upon error-driven learning of the modified patterns, a high number of amino-acid patterns with confidence levels of 0.8 and higher have been obtained. Combined with other methods, this technique holds potential for automated pattern acquisition from biological sequences.
Strategies for Improving Multiple Alignment of Retrotransposon Sequences
Renyi Liu and Eileen Kraemer
Department of Computer Science, University of Georgia, Athens, GA 30602 USA
Multiple sequence alignment plays a crucial role in extracting structural, functional, and evolutionary information from the exponentially growing sequence data from the ongoing genome sequencing. Although there are a number of multiple sequence alignment algorithms and programs available, biologists often find it difficult or time consuming to choose the appropriate algorithm and to interact to refine the resulting alignment.
In this work, we first conducted a comparative study of three alignment programs, DIALIGN, Clustalw, and Prrn, which are representatives of local, progressive, and iterative programs, respectively. Entropy was used as the alignment quality indicator. It was shown that the performance of Clustalw and Prrn were close to each other and better than that of DIALIGN. We then experimented with some strategies to improve alignment quality, such as realigning certain sequences or sequence range with different programs or parameters and hand editing, with the alignment of some retrotransposon sequences as a case study. A graphical tool, named AlignAgain, was built to display alignments, evaluate alignment quality, and improve resulting alignments. AlignAgain is written in Java and allows users to realign whole or partial sequences either with different programs such as CLUSTALW and PRRN or with the same program but different parameters, conduct alignments locally or remotely, edit alignments by inserting or deleting gap letters, and append sequences with profile alignment.
Detailed results of the comparison study and links to AlignAgain may be found at:
jerry.cs.uga.edu/~renyi
Refining Function Prediction by Analyzing Site Specific Amino Acid Conservation: A Case of PAS Domain-Containing Chemoreceptors
Qinhong Ma(1), Barry L. Taylor(1) and Igor B. Zhulin(2)
(1)Department of Microbiology and Molecular Genetics, School of Medicine,
Loma Linda University, Loma Linda, CA 92350 USA and (2)School of Biology,
Georgia Institute of Technology, Atlanta, GA 30332-0230 USA
One of the major goals of comparative genomics is to predict a biological
function for proteins by using a variety of bioinformatics tools. Function
prediction for some classes of proteins is difficult due to their mosaic
structure and the presence of highly variable domains. Even when all domains
can be detected in a given protein, prediction of exact biological function
might be a challenge.
PAS domains are sensory elements in various classes of signal transduction
proteins in organisms ranging from Bacteria and Archaea to humans. PAS domains
are implicated in sensing oxygen, redox potential, light and small ligands inside
a living cell. The Aer chemoreceptor of
Escherichia coli is a model PAS domain-containing sensor, which governs bacterial motility in response to changes in the redox potential. The protein sequence of Aer_Ecoli contains an N-terminal PAS domain and a C-terminal chemoreceptor domain (MA domain in SMART database). Using BLAST searches of microbial databases, we have identified 55 apparent homologs of Aer that have N-terminal PAS domain(s) and a C-terminal MA domain. Phylogenetic analysis revealed that all PAS-containing receptors belong to several distinct classes. Protein from the first class all have a single PAS domain, where amino acids are conserved in specific positions crucial for FAD binding and signaling by Aer, as revealed by multiple sequence alignments and mapping on known 3D structures. Therefore, they all are predicted to be sensors of redox potential that utilize signaling mechanism of Aer_Ecoli. Proteins from other phylogenetic clusters have one to three repeats of the PAS domain, however most residues essential for FAD, FMN or heme binding and signaling are not conserved within their PAS domains. This rules out a possibility for these receptors to be sensors of redox potential or oxygen. Our results demonstrate that similarity searches and analysis of the domain architecture are not sufficient for accurate prediction of biological function for signal transduction proteins. Analysis of site specific conservation of amino acids known to be essential for the function is one
of approaches to improve
in silico predictions.
A DNA Repair System Specific for Thermophilic Archaea and Bacteria Predicted by Genomic Context Analysis
Kira S. Makarova (1,2), L. Aravind(1), Nick V. Grishin(3), Igor B. Rogozin(1),
Eugene V. Koonin(1)
(1)National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
(2)Department of Pathology, F.E. Hebert School of Medicine, Uniformed
Services University of the Health Sciences, Bethesda, MD 20814-4799 USA;
(3)Howard Hughes Medical Institute and Department of Biochemistry, University
of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA
During a systematic analysis of conserved gene context in prokaryotic genomes, a previously undetected, complex, partially conserved neighborhood consisting of more than 20 genes was discovered in most archaea (with the exception of Thermoplasma acidophilum and Halobacterium) and some bacteria, including the hyperthermophiles Thermotoga maritima and Aquifex aeolicus. The gene composition and gene order in this neighborhood vary greatly between species, but all versions have a stable, conserved core that consists of five genes. One of the core genes encodes a predicted DNA helicase, often fused to a predicted HD-superfamily hydrolase, and another encodes a RecB-family exonuclease; three core genes remain uncharacterized, but one of these might encode a nuclease of a new family. Two more genes that belong to this neighborhood and are present in most of the genomes, in which the neighborhood was detected, encode, respectively, a predicted HD-superfamily hydrolase (possibly, a nuclease) of a distinct family and a predicted, novel DNA polymerase. Another characteristic feature of this neighborhood is the expansion of a superfamily of paralogous, uncharacterized proteins, which are encoded by at least 20-30% of the genes in the neighborhood. The functional features of the proteins encoded in this neighborhood suggest that it encodes a previously undetected DNA repair system, which, to our knowledge, is the first repair system largely specific for thermophiles to be identified. This hypothetical repair system might be functionally analogous to the bacterial- eukaryotic system of translesion, mutagenic repair whose central components are DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily, which typically are missing in thermophiles.
Predicting Class II MHC/Peptide Multi-Level Binding with an Iterative Stepwise Discriminant Analysis Meta-Algorithm
Ronna R. Mallios
Office of Sponsored Projects and Research, University of California, San Francisco, 2615 East Clinton Avenue, Fresno, CA, 93703 USA
The initial immune response to an extra-cellular pathogen begins with the capture of the pathogen by a macrophage, dendritic cell, or B lymphocyte. In the cell's interior, the protein portion of the peptide is degraded into peptide fragments. Class II major histocompatibility complex (MHC) molecules bind to areas of the peptide fragments that are designated agretopes. The agretope/MHC complex travels to the cell surface where the class II MHC molecule displays the fragment to nearby CD4 T lymphocytes. When a CD4 T lymphocyte binds to the exposed area of the peptide fragment, designated an epitope, an immune response is initiated.
Each binding peptide fragment is comprised of a linear arrangement of amino acid residues. Knowledge of the amino acid sequence of an agretope is useful in vaccine development and immunotherapy. A motif or quantitative model that recognizes agretopes can be used to screen large numbers of potential binding peptides, reducing laboratory time and costs.
Previous efforts have developed algorithms that successfully separate binding peptides from non-binding peptides for various HLA-DR molecules. The problem of classifying peptides into three or more categories of binding affinity is much more difficult than the dichotomous problem. A large part of the difficulty is due to the fact that the binding affinities found in public databases are produced by a variety of experimental methods. As such, a peptide reported as a high-binder by one method might be classified as a moderate-binder by another method.
This study explores expansion of a dichotomous iterative Stepwise Discriminant Analysis (SDA) meta-algorithm to the general multi-level problem. It seeks to ascertain if the algorithm is relevant and if so, how it compares with other approaches.
HLA-DR1 was selected as the class II MHC molecule of investigation. A database of peptides classified as high binding, moderate binding or non-binding was assembled from the MHCPEP internet database and the published literature. In accordance with published literature, agretopes of length 9 were selected as the units of investigation.
The general algorithm is as follows:
Initialization: (1.) A permanent non-binding dataset is created by entering every subsequence of length 9 from each non-binding peptide. (2.) An initial binding dataset is created by entering every subsequence of length 9 from each binding peptide. (3.) An initial application of SDA produces one classification function for each of the three binding levels.
Repeat until Convergence is Reached: (1.) Create a new binding dataset utilizing the current classification functions. Select from each binding peptide the subsequence that scores the highest according to the appropriate classification function. (2.) Apply SDA to produce new classification functions for each of the three binding levels.
The resulting model correctly classifies over 85% of the peptides in the database. The HLA-DR1 multi-level binding motif is in agreement with other studies and the level of accuracy is competitive. The results suggest that moderate-binders follow a different pattern from high-binders.
A similar study using regression analysis can corroborate or challenge this conclusion. Regression analysis, however, requires standardized reliable measurements of binding affinity. A well maintained website specializing in standardized binding affinities for peptide/HLA-DR complexes (including non-binders) would expedite the investigation of this problem.
Prediction of the Transmembrane Regions of Beta Barrel Membrane Proteins
with a Predictor Based on HMM and Neural Networks
P.L.Martelli(1), A.Krogh(2) and R.Casadio(1,3)
(1)Laboratory of Biocomputing, Centro Interdipartimentale per le Ricerche
Biotecnologiche (CIRB), Bologna, ITALY;
(2)Centre for Biological Sequence Analysis, the Technical University of Denmark,
Lyngby, DENMARK;
(3)Laboratory of Biophysics,Department of Biology, University of Bologna, Bologna, ITALY
Beta-barrel membrane proteins are inserted in the outer membranes of bacteria,
mitochondria and chloroplasts by means of antiparallel beta strands[1]. The
prediction of the structure of these proteins consists in the prediction of
the position of beta-strands along the sequence. A method based on neural
networks is trained and tested on a non-redundant set of beta-barrel membrane
proteins known at atomic resolution with a jack-knife procedure [2]. This
method predicts the topography of transmembrane beta strands with residue
accuracy as high as 78 % when evolutionary information is used as input to
the network. The neural network results are improved with a post-processing
procedure based on Hidden Markov Models (HMM). The new algorithms we developed
make possible to train HMMs on the basis of neural network outputs and to
perform predictions that include the typical topological constraints of this
class of proteins (e.g. segment lengths, even number of beta-strands). HMMs
based on evolutionary information can be trained by means of similar algorithms.
After a jack knife procedure, the predictor assigns: - the correct structure to
80 % of the residues; - the correct position to 95 % of the 158 beta-strands
included in the training set; -the correct number of beta-strands along the
equence for 10 out of the 11 examples of the training set; We propose this as
a general method to fill the gap of the prediction of the structure of beta-
barrel membrane proteins. Furthermore, the HMM based on evolutionary information
can filter beta-barrel membrane proteins out from a set containing globular and
all alfa membrane proteins.
References
1. Schulz, GE (2000) "Beta-barrel membrane proteins."
Curr Op Struct Biol
10: 443-447.
2. Jacoboni I, Martelli PL, Fariselli P, De Pinto V e Casadio R (2001)
"Prediction of the transmembrane regions of beta-barrel membrane proteins
with a neural network-based."
Protein Sci 10:779-787
3. Durbin R, Eddy S, Krogh A, Mitchinson G (1998) "Biological sequence
analysis: probabilistic models of proteins and nucleic acids."
Cambridge Univ Press, Cambridge.
Automated Annotation of Viral Genomes
Ryan Mills, John Besemer, Alex Lomsadze and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332-0230 USA
The GenBank database currently contains over 700 viral genomes. The diversity
of this set makes it difficult to find the boundaries of these genes using a
unified approach. Genome size and the host that the virus infects are two
factors that determine which of the available gene finding algorithms is most
appropriate. The smallest viruses and phages do not contain enough DNA for
training utilizing traditional methods. For these cases, the GeneMark.hmm
program along with heuristically derived models of protein-coding DNA is the
suggested method.
For phage genomes larger than 10 kb, the GeneMarkS program, which utilizes an iterative self-training algorithm, produces accurate results. GeneMarkS makes use of a ribosomal binding site model to aid in the prediction of the starts of genes.
The genomes of viruses that infect eukaryotes can be analyzed with a new modified version of the GeneMarkS program, called GeneMarkS EV. This self-training program builds a model of start codon context along with the protein-coding and non-coding DNA models in each of its iterations.
A database of the predictions made utilizing these approaches will be made available on our web site at:
opal.biology.gatech.edu/GeneMark.
Comparative Genomics of Two-Component Signal Transduction in
Pseudomonas aeruginosa and Vibrio cholera
Christophe Mougel and Igor B. Zhulin
School of Biology, Georgia Institute of Technology, 310 Ferst Drive, Atlanta GA 30332-0230 USA
Living organisms monitor the environment and adjust their behavior,
metabolism and development in response to changes in physico-chemical
parameters. Even simple prokaryotic organisms possess sophisticated
signal transduction networks that contain specialized receptors directly
interacting with environmental cues. Two-component regulatory systems are
the main mean of signal transduction in Bacteria and are also present in
Archaea, low eukaryotes and plants.
We performed a comparative genomic analysis of the two-component (histidine
kinase – response regulator) systems of two species of pathogenic
gamma-proteobacteria,
P. aeruginosa and
V. cholerae that
have diverged 7 million years ago. Sixty-four sensor histidine kinases
were identified in
P. aeruginosa , and forty in
V. cholerae.
However, there are only ten sensors conserved between the two species. The domain architecture was determined for all sensory proteins, and a correlation was found between the domain architecture and the phylogeny of histidine kinases. Phylogenetic analysis resulted in the identification of several paralogous sensors in both species and allowed us to predict a possible function for response regulators whose genes are not paired with histidine kinases on the chromosome. Comparative genomics of signal transduction provides a useful tool for understanding the biology of these important pathogens.
A New Approach to Sequence Assembly using Divide-and-Conquer Algorithms
Hasan H. Otu and Khalid Sayood
Department of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0511 USA
We propose a new algorithm or assembling fragments from a long DNA sequence obtained by shotgun sequencing. The proposed algorithm solves the orientation, overlap and layout phases simultaneously. The fragments are clustered using their Average Mutual Information (AMI) profiles using the k-means algorithm. The AMI profile of the fragments prove to be a distinctive measure as fragments that belong to the same region of the target sequence have similar AMI profiles. Moreover, AMI profiles are robust to errors and remain unchanged when calculated for the reverse complement of the fragment. Clustering the fragments reduces the unnecessary computational burden of considering the collection of fragments as a whole. Instead, the orientation and overlap detection are solved efficiently within the clusters as we have a feasible number of fragments in each cluster coming from the same region of the target sequence. The consensus sequence of each cluster is considered to form a new set of fragments and the basic approach is repeated. This recursion process is repeated until there is one cluster left or until no new cluster is born. In the first case we have a final consensus sequence for the target, in the second case, we end up with a number of contigs which can be ordered arbitrarily. The simulation results are very promising both for artificial and real data sets. In the case of zero sequencing error, the final consensus sequence is identical to the target sequence using a coverage of five. When error rate is increased to 5%, using a coverage of five, the algorithm reconstructs the target sequence with over 99% similarity and within 2% of its length. Future work can focus on investigating different methods for clustering -as an alternative to k-means- and different methods for measures of similarity -as an alternative to AMI profiles.
SUBLIME 1.0: a Java-Based Tool to Automate BLAST Searches and to Comprehensively Identify Homologs of a Query Sequence
Jerry J. Palmer and John M. Logsdon, Jr
Program in Genetics and Molecular Biology and Department of Biology, Emory University, Atlanta, GA 30322 USA
BLAST is widely used similarity search and alignment tool designed to explore sequence databases for similarities to a given query sequence. To assess whether a given hit, or alignment, constitutes evidence of homology, a statistical model to assess the strength of the alignment is compared to what can be expected from chance alone. A BlastP search—when done at NCBI typically takes 1-10 minutes—may yield N significant hits for a particular protein query (with significance defined by some threshold expect value). While these hits may represent N proteins that are likely homologous to the initial query, they are often only an incomplete set of all homologous sequences in the database. To obtain a more complete list of homologs, a BlastP search can be carried out against each of the N protein hits. This procedure may result in additional, unique, hits at the cost of several man-hours of repetitive work (if done manually); the process of BLASTing each result and adding the results to a master list would be repeated until no more new sequences were found. To automate this process, we have created a Java program called SUBLIME, for Search Using BLast Iteratively for Molecular Evolution. This multi-threaded application which directly queries the NCBI databases has already proven useful—this previously time-consuming and tedious work can now be accomplished in minutes (often < 15 min.). Upon completion, SUBLIME automatically generates a web page that contains a master list of proteins (putative homologs), the BlastP and TBlastN results for each protein query and a link for each protein to its Entrez record at NCBI. We are currently in the process of adding additional features to the SUBLIME application and we will present the results of some of this ongoing development.
Integrated Genetic Map Service (IGMS)
Harald Pankow, Heike Pospisil, Alexander Herrmann, and Jens G. Reich
Max-Delbrueck-Center for Molecular Medicine, Department of Bioinformatics, Robert-Roessle-Str.10 13092 Berlin-Buch, GERMANY
We present three novel functions implemented in the IGMS in Berlin-Buch.
The IGMS is a comprehensive information system that combines the knowledge
from genomic sequence, genetic map and genetic disorders databases.
This system is updated weekly and focuses on the analysis of EST data.
The first application identifies UniGene clusters that are differentially
expressed in different types of cancer with respect different reference
tissues, using for example, as criteria defined ratios of the number of
ESTs found in the tumour tissue as compared to the number found in normal
tissues and a defined number of ESTs per cluster. The results can be combined
with clinical data to asses the potential relevance of specific genes for
patient survival or metastatic spread.
The second application maps EST with a specific expression profile, e.g.
representing genes over expressed in breast cancer, to the corresponding
regions of the genome and vice e versa, e.g. maps all genes on chromosome
8 that are over expressed in breast cancer.
The third application generates a database of alternative splice forms for eight organisms from EST and mRNA sequence data. The results can be used to find splicing patterns specific for certain tissues or tumour types.
Dealing with Errors in Interactive Sequencing by Hybridization
Vinhthuy Phan and Steven Skiena
Computer Science Department, State University of New York at Stony Brook, Stony Brook NY, 11794-4400 USA
A realistic approach to Sequencing by Hybridization must deal with realistic
sequencing errors. The results of such a method can surely be applied to
similar sequencing tasks.
We provide the first algorithms for interactive sequencing by hybridization which are robust in the presence of hybridization errors. Under a strong error model allowing both positive and negative hybridization errors without repeated queries, we demonstrate accurate and efficient reconstruction with error rates up to 7%, using 11 DNA sequences from GenBank. Under the weaker traditional error model of Shamir and Tsur, RECOMB 2001, p269-277, we obtain accurate reconstructions with up to 20% false
negative hybridization errors.
Finally, we establish theoretical bounds on the performance of the sequential
probing algorithm of Skiena and Sundaram,
J. Computational Biology,
1995, p333-353, under the strong error model.
Mining SNPs for Associating Disease with Transcription Factor Binding Site Altered by Mutation
Julia Ponomarenko, Tatyana Merkulova, Galya Orlova, Elena Gorshkova,
and Misha Ponomarenko
Institute of Cytology and Genetics, Novosibirsk, 630090, RUSSIA
The SNPs-referred alterations in both conserved codons and splice sites and,
hence, protein structure-function relationships are explained easier than
in case of variable DNA sites binding transcription factors (TF).
That is why we have developed a system rSNP_Guide,
wwwmgs.bionet.nsc.ru/mgs/systems/rsnp,
associating SNP caused disease with TF site altered by mutation
[Ponomarenko et al., NAR, 2001, 29, 312-316]. Our system treats two sorts
of experimentally detected alterations: in DNA sequence and in DNA
binding pattern to unknown TF. As a result of rSNP_Guide application,
it is possible to predict the known TF sites by alterations in
sequence-dependent recognition Score, which are consistent with
experimental alterations in DNA binding to unknown TF. Our system provides
both brief and in-depth SNP-analysis dependent on a user's interest to a
number of known TF's, sites of which should be examined. We have already
tested our system by many genes with experimentally known
TF/disease-associations. Among these control data, CETP
(Sp1; dietary cholesterol response); TGM1 (AP-1 and CRE, squamous metaphasia),
factor-IX (Ets, Leyden form of hemophilia B); gpD
(GATA, Duffy blood group Fy{a-b-}); hMPO (Sp1, myelocytic leukemia);
hAG (ER, myocardial infraction); h-delta-G (GATA, delta-thalassemia);
hRB (Sp1, abnormal tumor suppression) were examined. In addition,
we have tested our system by using the site-directed mutagenesis data
of both "multiple substitutions" and "deletion" types. In this cause,
the known TF sites damaged artificially in regulatory regions of the
genes rAT(1A)R-C (MEF-2), hCD4 (Ets and ATF), hTOP3 (YY1 and USF),
AchR-delta (MyoD and E2A), p53 (NFkB), c-myc (NFkB), iNOS (IRF-1)
and hsp70 (HSF) were treated. Finally, two novel TF sites, SNP-caused
alterations in which could be associated with diseases, were predicted
and, then, successfully confirmed experimentally. Fist, GATA in the
second intron of the mK-ras gene causes lung tumor. Second, YY1 absent
in the sixth intron of the hTDO2 gene causes mental disorders. With this
in mind, we hope that our system rSNP_Guide could be applicable to the
SNP-related analysis.
Presence of ATG Triplets in 5’ Untranslated Regions of Eukaryotic cDNAs
Correlates with a "Weak" Context of the Start Codon
Igor B. Rogozin(1), Alexey V. Kochetov(2), Fyodor A. Kondrashov(1),
Eugene V. Koonin(1) and Luciano Milanesi(3)
(1)National Center for Biotechnology Information,
National Library of Medicine, National Institutes of Health, Bethesda, MD, USA;
(2)Institute of Cytology and Genetics, 10. Lavrentyev Ave., Novosibirsk 630090, RUSSIA;
(3)Istituto di Tecnologie Biomediche Avanzate, Consiglio Nazionale Delle Ricerche,
via Fratelli Cervi 93, 20090 Segrate (MI), ITALY
The context of the start codon (typically, AUG) and the features of the 5' untranslated
regions (5' UTRs) are important for understanding translation regulation in eukaryotic
mRNAs and for accurate prediction of the coding region in genomic and cDNA sequences.
The presence of AUG triplets in 5'UTRs (upstream AUGs) might effect the initiation rate and,
in the context of gene prediction, could reduce the accuracy of the identification of the
authentic start. To reveal potential connections between the presence of upstream AUGs and
other features of 5'UTRs, such as their length and the start codon context, we undertook a
systematic analysis of the available eukaryotic 5'UTR sequences. We show that a large
fraction of 5'UTRs in the available cDNA sequences, 15-53% depending on the organism,
contain upstream ATGs. A negative correlation was observed between the information content
of the translation start signal and the length of the 5'UTR. Similarly, a negative correlation
exists between the "strength" of the start context and the number of upstream ATGs.
Typically, cDNAs containing long 5'UTRs with multiple upstream ATGs have a "weak"
start context, and in contrast, cDNAs containing short 5'UTRs without ATGs have
"strong" starts. These counter-intuitive results may be interpreted in terms of
upstream AUGs having an important role in the regulation of translation efficiency
by ensuring low basal translation level via double negative control and creating
the potential for additional regulatory mechanisms. One of such mechanisms, supported
by experimental studies of some mRNAs, includes removal of the AUG-containing portion
of the 5'UTR by alternative splicing. Availability: An ATG_EVALUATOR program is available
upon request from I.B.Rogozin (
rogozin@ncbi.nlm.nih.gov)
Sequence-Structure Space and Resultant Data Redundancy in the Protein Data Bank
I.N. Shindyalov(1) and P.E. Bourne(2,3)
(1)San Diego Supercomputer Center, University of California San Diego,
9500 Gilman Drive, La Jolla, CA 92093-0537 USA;
(2)Department of Pharmacology, University of California,
San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA;
(3)The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037 USA
A study of sequence-structure space and resultant data redundancy has been performed
using the Combinatorial Extension (CE) algorithm for determining structural alignment and
BLAST for determining sequence similarity. Significant clusters in sequence-structure space
associated with recurrent structures (convergent evolution) and protein superfamilies
(divergent evolution) have been described. These observations have been compared to the
scop classification of protein domains that defines similar features. Both methods indicate an
enormous redundancy of data in the Protein Data Bank (PDB), and hence a need in defining
representative (non-redundant) sets of proteins especially for use in various computational analyses.
Various representations of the PDB as sequence and/or structure non-redundant set of protein
chains have been defined with from 1200 to 6000 representatives. It was demonstrated that
commonly used sequence similarity criterion alone is not very efficient in selecting unique proteins.
We demonstrate that sequence or structure based representative sets of single polypeptide chains
contain approximately 20-30% redundancy by complementary (sequence vs. structure) criteria.
We propose here an approach for building representative sets using combined sequence and structure
similarity criterion with additional conditions requiring adequate representation of proteins excluded
from the set. Analysis of representative sets obtained using these various criteria and the correlation
between different sets are analyzed. Representative sets are updated on a weekly basis and
available from
cl.sdsc.edu/nr.html.
Identification and Accurate Modeling of Motifs Specifying Start Codon Locations
in the Genome of an Unusual Cyanobacterium
Mark J Schreiber(1), John D Besemer(2), Mark Borodovsky(2), Chris M Brown(1)
(1)Department of Biochemistry, Univeristy of Otago, PO Box 56, Dunedin, New
Zealand and (2)Department of Biology, Georgia Institue of Technology, 310 Ferst
Drive, Atlanta, GA 30332, USA
Using the genome sequence of the cyanobacterium
Synechocystis sp. PCC6803 and
publicly available gene loci predictions we identified a previously unobserved
element surrounding the start codon. Notably the Shine-Dalgarno ribosome-binding
site conserved in almost all bacteria was found to be absent. Information Theory
predicts that this element contains sufficient information to allow
discrimination of the start codon by the ribosome.
To determine if systematic error in genome annotation had caused this
observation we assessed the accuracy of the start codon predictions using a set
N-terminally mapped 2D gel spots from the organism. While the accuracy of start
codon predictions was found to be only 75% we did not believe that this would
seriously bias the analysis. However, to further improve the predictions we
developed a technique of iterative training that could provide start codon
predictions with greater than 95% accuracy using only very small verified
datasets. This technique was evaluated in both
Synechocystis and
E. coli showing
no prior dependence on any organism specific motifs.
The Phosphoproteome Predicted: Using Neural Networks for Predicting Kinase
Substrate Sites
Thomas Sicheritz-Ponten, Nikolaj Blom and Soren Brunak
Center for Biological Sequence Analysis, Technical University of Denmark, Bldg
208, DK-2800 Lyngby, DENMARK
Protein phosphorylation is the primary mean of switching the activity of a
cellular protein rapidly from one state to another. Thus, protein
phosphorylation is considered being a key event in many signal transduction
pathways of biological systems. Phosphorylation of substrate sites at serine,
threonine or tyrosine residues is performed by members of the protein kinase
family. This gene family consists of app. 860 members and is the second largest
family in the human genome.
We aim to describe the complete predicted phosphoproteome: a description of the
entire collection of phosphoproteins in the eukaryotic cell, the sites of
reversible phosphorylation and the kinase subtype performing the phosphorylation
event. Earlier, we developed a method, NetPhos[1], for predicting the general
probability of a given residue being a potential phosphorylation site or not. In
order to predict the identity of the most probable kinase for each site we have
now developed NetPhosK[2].
To validate our approach, we are using information about evolutionary
conservation from related species. For example, if a specific serine residue is
predicted as a potential PKA site in human protein X and is also predicted to be
a PKA site in the conserved rat and mouse homologs of protein X, we consider
this additional strong confidence in the prediction. On the other hand, a high
kinase score in combination with a lack of conservation of the acceptor residue
in related species indicate that the site is specific to a given species or that
the site could be phosphorylated in vitro only, lacking a physiological role.
In order to characterize the Human PhosphoProteome we apply the predictor on the
draft genome containing 24819 genes from Ensembl (version 1.0) and present
statistics on potential acceptor sites, overlapping specificities and orphan
protein families. The kinase-specific prediction server will be made publicly
available on the Internet.
References
1.
www.cbs.dtu.dk/services/NetPhos
"Sequence- and Structure-Based Prediction of Eukaryotic Protein Phosphorylation
Sites.", Blom, N., Gammeltoft, S., and Brunak, S. (1999),
Journal of Molecular
Biology: 294(5), 1351-1362
2."NetPhosK: Prediction of Protein Kinase specificity of eukaryotic
phosphorylation sites", Thomas Sicheritz-Ponten, Nikolaj Blom and Soren Brunak
(manuscript in preparation)
Phylogenomic Atlases for Sequenced Microbial Genomes
T. Sicheritz-Ponten, J.O. Andersson, D. Ussery, A.J- Roger, J. Logsdon, R. Hirt
and T.M. Embley
Center for Biological Sequence Analysis, Technical University of Denmark, Bldg
208, DK-2800 Lyngby, DENMARK
We have developed a method which combines phylogenomic information with DNA
structural parameters. Phylogenetic trees are constructed for each gene in the
genome, using PyPhy, and the results are visualized using DNA atlases. The
original idea of PyPhy has been extended for quick tree-mining of ongoing EST
projects.
Raw sequence data from ongoing EST projects is often incomplete and less
reliable than edited and annotated end-product sequences. In order to facilitate
the automated generation of phylogenetic trees even from partial sequence data
we transfer the phylogenetic start sequence to a so called seed which is by our
definition the first and best match of the partial sequence against the
non-redundant sequence database. The program identifies the first best match
(via blastx ) and uses this "seed" sequence to automatically select homologues
which are together with the translated partial sequence (from the blastx result)
used for the alignment and phylogenetic reconstructions.
In order to facilitate the discovery of "interesting" features, we integrated
AutoTreeS into the DNA Atlases which plots structural measures for all positions
in a long DNA sequence ( an entire chromosome) in the form of color-coded wheels
which combine evolutionary information from PyPhy and provide an excellent
genomic data mining tool. In completely sequenced genomes, the order and the
position from individual genes is known which facilitates the drawing of
phylome. Sequences from EST projects are most of the time of unknown position
relative to each other. In order to draw phylomes we developed aditional
array-structure based phylome visualization.
Splice Site Prediction by Using Neural Networks, Revisited Topic
Yuan Tian, Naira Hovakimyan and Mark Borodovsky
School of Biology, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230
USA
Artificial neural networks have been applied to the prediction of the splice
sites of the experimentally validated genes. In chromosome I of Caenorhabditis
elegans sequences of 100nt length has been collected from genes with AGs (for
acceptor site) and GTs (for donor site) in the middle positions. These sequences
have been split into three sets, one of those being used for the NN training,
the other two - for the NN validation. We used the SNNS, Stuttgart Neural
Network Simulator [1], to build neural networks with single hidden layer.
Numbers of hidden layer neurons ranged from 3 to 15. Our experiments have shown
that 5 neurons in the hidden layer give the best result. With our method of
training we were able to detect in the test sets 88% of the acceptor sites with
0.023% of false positive prediction. When different lengths of sliding window
were tested, the 100nt gave the best prediction accuracy for acceptor sites.
Predictions have been also done for donor sites.
The results were compared with ones known in literature. Neural networks with
61nt long input window and 15 neurons in the hidden layer were applied for
Arabidopsis thaliana DNA [2]. Information from the global coding/non_coding
network was also used. That network could detect 80% of the acceptor sites with
0.034% false positive rate. Neural networks with 41nt long input window and 20
neurons in the hidden layer were applied for human DNA [3]. In combination with
the global coding/non_coding network, 90% of the true acceptor sites were
predicted with 0.162% false positives rate. Our predictions for validated genes
of C.elegans have been compared with performance of Netgene2 originally
described in [3]. For this species Netgene2 detected 70% of acceptor sites. In
the current paper we will present the results for splice site prediction for
human and Arabidopsis thaliana genomic sequences as well.
References
1.
www.informatik.uni-stuttgart.de/ipvr/bv/projekte/snns/snns.html
2. S.M. Hebsgaard, P.G.Korning, N.Tolstrup, J. Engelbtecht, P.Rouze,
S.Brunak (1996), "Splice site prediction in Arabidopsis thaliana pre mRNA by combining
local and global sequence information.",
Nucleic Acids Res 24(17):3439-52
3. S.Brunak, J. Engelbrecht, S. Knudsen (1991), "Prediction of Human mRNA Donor and
Acceptor Sites form the DNA Sequence.",
J Mol Biol 220:49-65
Determining the Minimum Number of Types Necessary to Represent the Sizes of
Protein Atoms
Jerry Tsai, Neil Voss, and Mark Gerstein
Department of Biochemistry & Biophysics, 103 Biochemistry/ Biophysics Building,
Texas A&M University, 2128 College Station, Texas 77843-2128 USA and Department
of Molecular Biophysics and Biochemistry Yale University, Bass Center, 266
Whitney Avenue
P.O. Box 208114, New Haven, CT 06520-8114 USA
Traditionally, for packing calculations people have collected atoms together
into a number of distinct "types". These types, in fact, often represent a heavy
atom and its associated hydrogens (i.e. a united atom model) since hydrogens are
not usually resolved in protein crystal structures. Also, atom typing is
traditionally done strictly according to basic chemistry. This usually gives
rise to 20 to 30 types of atoms in proteins -- such as carbonyl carbons,
carbonyl oxygens, methyl groups, and hydroxyl groups. No one has yet
investigated how similar in packing these chemically derived types are. Here we
address this question in detail, using Voronoi volume calculations on a set of
high-resolution crystal structures. We perform a rigorous clustering analysis
with cross-validation on tens of thousands of atom volumes and attempt to
compile them into types based purely on packing criteria. From this analysis, we
are able to determine a "minimal" set of 18 atom types that most efficiently
represent the spectrum of packing in proteins. Our analysis highlights a number
of inconsistencies in traditional chemical typing schemes. Some united atoms
exhibit unintuitive packing volumes. In particular, tetrahedral carbons with two
hydrogens are almost identical in size to many aromatic carbons with a single
hydrogen, which are thought to be smaller in size. Our programs available from
bioinfo.mbb.yale.edu/geometry
and
molmovdb.org.
Reannotation of the E. coli K12 Genome
Vera van Noort(1,2), Marie Skovgaard(1), Thomas Schou Larsen(1) and David Ussery(1)
(1)Centre for Biological Sequence Analysis, The Technical University of Denmark,
DENMARK and (2)Theoretical Biology / Bioinformatics, Utrecht University, The
NETHERLANDS
E. coli K12 MG1655 was sequenced in 1997. The genes that were annotated had
either experimental evidence or were predicted using codon usage statistics. In
1998 the annotation was updated by using the GeneMark program for prediction of
genes. A recent study showed that the number of annotated protein coding genes
in
E. coli K12 MG1655 is about 15 percent higher than the number of expected
genes, calculated based on stop codon frequency and matches of Long Open Reading
Frames (ORFs) to SwissProt. This 15 percent consists of ORFs that occur in the
genome by chance, but are not real genes. In biology and bioinformatics the
annotation of genomes is used for a number of purposes, for example the choice
of probes in micro array experiments, whole genome analysis, inclusion of
hypothetical proteins in protein databases like SwissProt upon which a lot of
analyses are based. Thus an accurate annotation is necessary. The annotation
that we have made, represents a more reliable set of genes than the current
annotation. Furthermore, we have given a measure of reliability to all Genbank
annotated genes and genes that were annotated by us. We have done this, firstly
by finding
E. coli proteins with experimental evidence in SwissProt and mapping
them to the genome.
Secondly genefinding was done using Profinder. Profinder is an HMM based
genefinder, which is trained on high quality training sets of gene-containing
sequences constructed from extensions of ORF homology hits in the SwissProt
database. Nullstates are stimated from the shadows of these high-confidence
genes. Using posterior logodds decoding, DNA sequences may then be scored for
gene content using the trained HMM. Only high scoring genes were included in the
reliable gene set.
Thirdly homology searches were done to non
E. coli genes in SwissProt and to
translated ORFs from fully sequenced genomes. Again only high scoring genes were
included for being reliable.
These three methods led to a set of reliable genes, that were visually inspected
using the Artemis program developped by the Sanger Centre. This made it clear
that most unreliable genes were short ORFs lying on the opposite strands of
genes in the direct environment. Such ORFs are also questionable because
prokaryotes tend to organize their genes in operons.
Apart from a measure of reliability, we also included the information of wether
a gene was found in a transcript or not during experiments. Using Affymetrix
micro array technology, binding levels of probes to mRNA were measured. It is
known that probes can have different binding affinities thereby displaying up to
50 fold difference in mRNA level for the same gene. We modelled these binding
affinities based on the sequence of the probes and their deviation from the gene
level of the gene it is part of, using neural networks. We corrected the probe
levels for calculated binding affinities, thereby getting probes displaying gene
levels that can be compared. Using these corrected probe levels, gene levels
were calculated and 'low', 'medium' or 'highly expressed' was added as a label
to our annotated genes.
As it is impossible for people other than the submitters of a genome to suggest
changes in Genbank entries, our annotation will be available on a webserver
www.cbs.dtu.dk .
Identifying Number of Clusters in Gene Expression Data
Dali Wang, Habtom Ressom, Mohamad T. Musavi, and Cristian Domnisoru
University of Maine, Department of Electrical & Computer Engineering,
Intelligent Systems Laboratory, 201 Barrows Hall, Orono, ME 04469, USA
Motivation: Clustering is a very useful and important technique for analyzing
gene expression data. Self- Organizing Map (SOM) is one of the most useful
clustering algorithms, which have been used to cluster the gene expression data.
SOM algorithms require the number of clusters as one of the initialization
parameter before clustering. However, we have no information about the number of
clusters in the gene expression data set. The method that is currently being
used is to validate the result from SOM to find the best numbers. This approach
is very inconvenient and time-consuming.
This paper applies a novel model of SOM, called Double SOM (DSOM) to cluster the
gene expression data set, which can overcome this limitation by clearly and
visually telling us how many clusters would be the best. To validate this
technique, we also use a novel validation technique, which is known as figure of
merit (FOM).
Results: We use DSOM to cluster an artificial data set and two kinds of real
gene expression data sets. Our results reveal that DSOM can not only cluster the
whole data but can tell us the best number of clusters in the whole data set
quickly and clearly.
Availability: All materials related to this paper are available upon request
from the authors.
Contact:
dwang@eece.maine.edu
Genome Trees Constructed Using Five Different Approaches Suggest New Major
Bacterial Clades
Yuri I. Wolf, Igor B. Rogozin, Nick V. Grishin, Roman L. Tatusov, Eugene V.
Koonin
National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, MD 20894 USA
The availability of multiple complete genome sequences from diverse taxa prompts
the development of new phylogenetic approaches, which attempt to incorporate
information derived from comparative analysis of complete gene sets or large
subsets thereof. Such attempts are particularly relevant because of the major
role of horizontal gene transfer and lineage-specific gene loss, at least in the
evolution of prokaryotes.
Five largely independent approaches were employed to construct trees for
completely sequenced bacterial and archaeal genomes: i) presence-absence of
genomes in clusters of orthologous genes; ii) conservation of local gene order
(gene pairs) among prokaryotic genomes; iii) parameters of identity distribution
for probable orthologs; iv) analysis of concatenated alignments of ribosomal
proteins; v) comparison of trees constructed for multiple protein families. All
constructed trees support the separation of the two primary prokaryotic domains,
bacteria and archaea, as well as some terminal bifurcations within the bacterial
and archaeal domains. Beyond these obvious groupings, the trees made with
different methods appeared to differ substantially in terms of the relative
contributions of phylogenetic relationships and similarities in gene repertoires
caused by similar life styles and horizontal gene transfer to the tree topology.
The trees based on presence-absence of genomes in orthologous clusters and the
trees based on conserved gene pairs appear to be strongly affected by gene loss
and horizontal gene transfer. The trees based on identity distributions for
orthologs and particularly the tree made of concatenated ribosomal protein
sequences seemed to carry a stronger phylogenetic signal. The latter tree
supported three potential high-level bacterial clades,: i)
Chlamydia-Spirochetes, ii) Thermotogales-Aquificales (bacterial
hyperthermophiles), and ii) Actinomycetes-Deinococcales-Cyanobacteria. The
latter group also appeared to join the low-GC Gram-positive bacteria at a deeper
tree node. These new groupings of bacteria were supported by the analysis of
alternative topologies in the concatenated ribosomal protein tree using the
Kishino-Hasegawa test and by a census of the topologies of 132 individual groups
of orthologous proteins. Additionally, the results of this analysis put into
question the sister-group relationship between the two major archaeal groups,
Euryarchaeota and Crenarchaeota, and suggest instead that Euryarchaeota might be
a paraphyletic group with respect to Crenarchaeota.
We conclude that, the extensive horizontal gene flow and lineage-specific gene
loss notwithstanding, extension of phylogenetic analysis to the genome scale has
the potential of uncovering deep evolutionary relationships between prokaryotic
lineages.
Expression Profiler: Software to Analyze and Visualize Gene Expression Profiles
Tao Wu and Eileen Kraemer
Computer Science Department, The University of Georgia, Athens, GA 30602 USA
Gene expression profiling has become an important method in genomic research.
Current software systems for visualizing and analyzing large amounts of
expression profiling data suffer from insufficient flexibility in zooming and
manipulating graphical representations of the expression data. This limits the
degree of detail at which a user is able to explore the expression data and
examine the results of numerous analysis methods on these data.
We have developed the ExpressionProfiler, a software system, written in Java,
for visualizing and analyzing gene expression profiling data. The
ExpressionProfiler allows very flexible zooming on the graphical representation
of the expression data, and supports various operations for editing the data,
and interacting with their graphical representation.
In the ExpressionProfiler, we have implemented two different views and one
clustering algorithm for the expression data -- the Unweighted Pair-Group Method
Average(UPGMA). However, the ExpressionProfiler has been built as an extensible
framework -- additional analysis algorithms and associated visualizations can be
added to the existing system easily and still enjoy the flexible zooming
capability the current system provides. Interactions with the current
visualizations include selection of subsets of genes and/or conditions, tree
restructuring, and reordering and regrouping of clusters. In addition, the user
is able to write out the resulting trees in standard formats, and to save or
print images of the trees and heat maps.
The ExpressionProfiler achieves all this with limited memory requirement -- it
maintains a buffered image, which is only part of the entire graphical
representation of the data. In this way, the ExpressionProfiler creates an
impression of smooth scrolling as the user requests different parts of the
visualization, without excessive use of memory.
Images of the ExpressionProfiler visualizations, as well as class files,
instructions for installation and use, and sample input files may be found at:
jerry.cs.uga.edu/~twu
Analysis of Gene Expression Data by Ellipsoid ART and ARTMAP
Rui Xu, Donald C. Wunsch II
Applied Computational Intelligence Laboratory, Department of Electrical and
Computer Engineering, University of Missouri – Rolla, MO 65409-0249 USA
1. Purpose
Advance in DNA microarray techniques makes it possible to measure gene
expression levels of thousands of genes simultaneously under different
conditions or treatments. To find the biological information behind the large
amount of data becomes a big challenge forcomputational biologists. Many
unsupervised clustering methods and supervised learning algorithms have been
successfully used in the field. In the study, we use a new family of neural
network architecture - Ellipsoid ART and ARTMAP (EA/EAM) to analyze the AML/ALL
data set and the human cancer cell (NCI60) lines data set.
2. Method
EA/EAM comes from the ideas in Fuzzy ART and ARTMAP. In this architecture,
hyper-ellipsoids are used to represent the shapes of categories generated
instead of hyper-rectangles. EA/EAM keeps all the properties of FA/FAM and may
describe the data structure more efficiently.
3. Result
Two data sets are presented to EA/EAM. One is the leukemia data set, which
includes samples of two classes of leukemia cancer (acute myeloid leukemia and
acute lymphoblastic leukemia). The results can classify all the training samples and
33 of the 34 test samples correctly. And the one with error is widely regarded
as an outlier by most of other classifiers. The NCI60 lines data set consists of
expression profiles for 1376 genes in a set of 60 human cancer cell lines. Most
of cell lines whose tissue has common origin are clustered in the same category.
4. Conclusion
The results show that EA/EAM is a very useful technique for analyzing
large-scale gene expression data, both for classification and for clustering.
Reference
T. Golub et al. "Molecular classification of cancer: Class discovery and class
prediction by gene expression monitoring.",
Science, 286: 531-537,1999.
G. Anagnostopoulos, M. Georgiopoulos. "Ellipsoid ART and ARTMAP for increment
clustering and classfication.", IJCNN01, pp.1221-1226, 2001.
Scherf U, Ross DT, et al. "A gene expression database for the molecular
pharmacology of cancer.",
Nature Genetics, 2000; 24(3): 236-44.
Robert Tibshirani et al. "Clustering methods for the analysis of DNA
microarray data.", Technical report, Department of Statistics, Stanford University.
DIGIT: A Novel Gene Finding Program by Combining Gene-Finders
Tetsushi Yada(1), Yasushi Totoki(2), Yoshio Takaeda(3), Yoshiyuki Sakaki(1),
Toshihisa Takagi(1)
(1)Human Genome Center, Institute of Medical Science, University of Tokyo, JAPAN;
(2)Genomic Sciences Center, RIKEN, JAPAN;
(3)Mitsubishi Research Institute, Inc., JAPAN
We have developed a general purpose algorithm which finds genes by combining
plural existing gene-finders. The algorithm has been implemented into a novel
gene-finder named DIGIT. An outline of the algorithm is as follows. First,
existing gene-finders are applied to an uncharacterized genomic sequence (input
sequence). Next, DIGIT produces all possible exons from the results of
gene-finders, and assigns them their exon types, reading frames and exon scores.
Finally, DIGIT searches a set of exons whose additive score is maximized under
their reading frame constraints. Bayesian procedure and hidden Markov model are
used to infer exon scores and search exon set, respectively. We have designed
DIGIT so as to combine FGENESH, GENSCAN and HMMgene, and have assessed its
prediction accuracy by using recently compiled benchmark data sets. For all data
sets, it has been revealed that DIGIT successfully discarded many false positive
exons predicted by gene-finders and yielded remarkable improvements in
sensitivity and specificity at the gene level compared with the best gene level
accuracies achieved by any single gene-finder.
A Visualization System for Protein Interaction Mapping
Yong Zhang(1), Hui Tian(1), Jonathan Arnold(2), Eileen Kraemer(1)
(1)Computer Science Department and (2)Genetics Department, University of Georgia,
Athens, GA, USA
An exciting challenge in science today is to use sequenced genomes to predict
how living systems function and evolve. The goal is to develop a new systems
approach using sequenced genomes to identify the molecular machines underlying
fundamental processes like transcription, metabolism, development, biological
clocks, transvection, mating, aging, and pathogenicity. Protein-protein
interactions are crucial to understanding these biological processes, and thus
protein interaction mapping is an important element of this work. Visualization
can provide scientists with insight into the relationships these proteins.
We have developed a tool designed to assist scientists in identifying clusters
in protein-interaction data, through visualization and interaction. To begin,
the user may select or provide a data set representing protein-protein
interactions. Input may consist of either a simple listing of names of
interacting pairs, as with the mapping data from Ito et al. (2000) on S.
cerevisiae, or may include a numerical value representing the strength of the
interaction. Users may then select from among several graph clustering, layout,
coloring, and shading algorithms, view a 3-dimensional display of the
protein-protein interaction map, and interact with this display to:
search for a particular protein in the graph
obtain additional information about nodes(proteins), edges(interactions), and
strongly connected components (interaction clusters)
adjust the graph layout by moving or deleting nodes or clusters
to select a node to serve as a "center", hide other nodes, and then
interactively add nodes back in a step-by-step fashion
to selectively color nodes to emphasize similarities or differences
apply graded coloration techniques to highlight the relative distance of
various proteins from a selected node or cluster
modify the user's perspective on the graph through position and rotation
control
The tool is implemented in Java-3D, which facilitates web-based interaction and
distribution of results. Several heuristic algorithms have been implemented
(simple, cluster, spring embedder, simulated annealing), and the results
compared, both for the usefulness of the resulting graph for emphasizing
clusters, and the time required to produce the graph. The executable version of
the program, instructions for download and use, and details of the comparison
are available download at:
jerry.cs.uga.edu/~yozhang
References
Ito, T.; Tashiro, K.; Muta, S.; et al (2000). "Toward a protein-protein
interaction map of the budding yeast: A comprehensive system to examine
two-hybrid interactions in all possible combinations between the yeast proteins."
Proc Natl Acad Sci USA 97(3):1143-1147.