Swiss Institute for Experimental Cancer Research, 1066 Epalinges s/Lausanne, Switzerland
Generalized profiles are computer-searchable descriptions of sequence families, domains, motifs, and other elementary components of genetic informations, which can also be interpreted as hidden Markov models (HMMs) of a particular architecture. Generalized profiles or HMMs are among the most effective tools for identifying and characterizing highly divergent protein homology domains, as judged by the following criteria: (i) discrimination of true members from chance matches, (ii) accurate definition of domain boundaries, (iii) correctness of profile-generated multiple sequence alignments. Their excellent performance with regard to these criteria makes comprehensive collections of profiles or HMMs extremely useful tools for automatic sequence annotation. This will be exemplified by a whole genome application using the PROSITE profile and PFAM HMM libraries. The talk will also address a number of important technical issues related to the application of generalized profiles. Several protocols to derive profiles from initial data will be described along with a comparative evaluation of their performances with respect to the above criteria. In addition, new solutions to the problem of estimating the statistical significance of profile matches will be presented that take into account various non-random properties of biological sequence sets, e.g. compositional bias, periodicities, subfamily over-representation, which are known to falsify statistical tests based on simple random models.