National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD 20894, USA
To extract maximum information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. By comparing proteins encoded in 7 complete genomes from 5 major phylogenetic lineages (5 bacterial, one archaeal, and one eukaryotic) and elucidating consistent patterns of sequence similarities, we delineated 720 Clusters of Orthologous Groups (COGs). Each COG consists of individual, orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis. The COG system was used to analyze three additional genomes, those of the giant symbiotic plasmid from Rhizobium sp., the pathogenic bacterium Helicobacter pylori, and the nematode Caenorhabditis elegans (~60% of the genome). The ROG system allowed semi- automatic functional annotation of the conserved portion of each gene set, and identification of common and rare phylogenetic patterns, which significantly differ in bacteria and eukaryotes. A systematic survey of conserved families missing in H. pylori suggests major revisions of the central metabolic pathways in this bacterium.