Department of Biological Sciences, Clark Atlanta University, Atlanta, Georgia 30314
Identification of gene sequences is an important task, especially due to the large amount of unannotated DNA in the sequence databases. We attempted to identify genes by designing a neural network (NN) that could recognize patterns in nucleic acid sequences.
We trained a NN on amino and nucleic acid sequences and then tested the NN's ability to predict the correct codon given an amino acid sequence. Different network configurations were used with varying numbers of input neurons that represented the amino acid and a constant representation for the nucleic acid. A multi-layer perceptron with one hidden layer with 5 to 9 neurons was used. In the best network 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted. The training set was composed of 60 human sequences with 600 codons; all codons were represented except for CAU (histidine). This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA database!
(This work is supported by NIGMS/MBRS #S06GM08247)