Protein Sequence Database Searches Using Compositionally Adjusted Amino Acid Substitution Matrices

Stephen Altschul
National Center for Biotechnology Information

Standard amino acid substitution matrices are constructed as log-odds ratios from large collections of alignments of related proteins. Any such collection has an implicit "standard" set of amino acid background frequencies. The matrices produced, however, often are used to compare proteins with quite non-standard amino acid compositions. We argue on theoretical grounds that this is inappropriate, and have described a method for transforming a standard matrix into one appropriate for comparing proteins with any non-standard compositions. Compositionally-adjusted matrices yield improved results from the twin perspectives of alignment score and alignment quality when proteins with strongly biased compositions are compared.
To what extent are such adjusted matrices of utility for general purpose protein database searches? Using standard test platforms, we compared a standard matrix to compositionally-adjusted matrices, with relative entropy left unconstrained, or constrained in various ways. We found that constraining the relative entropy of the compositionally adjusted matrix to a fixed value in the new compositional context generally produced the best results. We also found that if the sequences compared are not known to have strong compositional biases, then it is still on average advantageous to use an adjusted matrix when the sequences satisfy certain simple length or compositional inequalities. Applying these findings to general-purpose database searches can lead to a significant improvement in retrieval performance, with a minimal increase in execution time.

Presentation (PowerPoint File)

Back to Workshop IV: Search and Knowledge Building for Biological Datasets