(1) Center for Cellular and Molecular Biology, Hyderabad 500 007, India
(2) NIAID, National Institutes of Health, Bethesda, MD, USA
(3) School of Environmental Sciences,
(4) School of Life Sciences,
(5) School of Physical Sciences, Jawaharlal Nehru University, New Delhi 110 067, India.
We have developed a technique to locate genes or exons in silico using a Fourier based measure which quantifies the three--base periodicity that is characteristic of coding regions. Genomic sequences are converted into digital signals, which are then subject to Fourier transformation. The 3--base correlation appears as a peak in the power spectrum at frequency f=1/3, and we have found from an extensive study of several thousand genes from a variety of organisms (ranging from prokaryotes such as bacteria to higher eukaryotes, for example the human genome, including the archaebacterium Methanococcus jannaschii) that the signal--to--noise ratio P_N, where N is the length of the sequence) of this peak is a good indicator of coding potential. A value of P_N above a specified threshold (our studies indicate that this value is around four) implies a coding region, and a value below the threshold is indicative of non-coding regions.
Our algorithm, termed GeneScan, uses this observation to analyze a given genomic sequence by measuring the local value of P_M in a window of length M, and thereby identify protein-coding regions. The algorithm has been tested on genomic sequences from Saccharomyces cerevisiae, Haemophilus influenza, Plasmodium falciparum, M. jannaschii, Helicobacter pylori and a number of other organisms. The performance of the algorithm is comparable to those of existing techniques, although a percentage of genes in all organisms do not have the 3--base periodicity, and are thus invisible to our method of identification. Such invisible genes have been analyzed and are shown to have unusual codon usage and bias. The methodology will detect pseudogenes, but other than these, few false positive assignments are made.
The advantage of such a measure is that it is relatively insensitive to frameshift errors, and is universally applicable since it is not organism specific. Furthermore, the methodology seems to be insensitive to G+C content, and can thus be profitably used as an additional measure to detect genes in novel organisms.