Data Mining for Protein Structure Prediction

Mohammed Zaki (RPI) (S)

Proteins fold spontaneously and reproducibly into complex
three-dimensional globules when placed in an aqueous solution, and,
the sequence of amino acids making up a protein appears to completely
determine its three dimensional structure. This self-organization
cannot occur by a random conformational search for the lowest energy
state, since such a search would take millions of years and proteins
fold in milliseconds (known as levinthal's paradox).

In this talk I'll highlight some of the data mining challenges for the
protein folding problem, i.e., how to predict the three dimensional
tertiary structure of a protein given its linear amino acid sequence.
I'll discuss some recent work on using a hybrid approach to predict
local structure using a Hidden Markov Model, and then infering contact
rules based on association mining. The HMM models the interactions
between adjacent short regions of the protein sequence, and so
attempts to model the propagation of structure along the sequence. To
detect long-range amino-acid contacts we discover rules to predict if
a pair of residues is in conact or not. In the testing phase one can
predict the contact map for an unknown protein, and from the contact
map one can recover the 3D shape. I'll discuss limitation of the
current approach, and some future directions on how to incorporate
geometric constraints while mining and whether one can learn the
folding pathways.


Mohammed Zaki is currently an Assistant Professor in the Computer
Sciences Department at Rensselaer Polytechnic Institute. He received
my M.S. ('95) and Ph.D. ('98) degrees in computer science from the
University of Rochester. His research interests include the design of
efficient, scalable, and parallel algorithms for various data mining
techniques. He is especially interested developing novel data mining
techniques for applications like bioinformatics, web mining, and
materials informatics. He recently received a CAREER Award from the
National Science Foundation for his research on "Application-Oriented
Large-Scale Parallel Data Mining."

Dr. Zaki has published over 50 papers on Data Mining. He has co-edited
4 books, including "Large-scale Parallel Data Mining," LNAI Vol. 1759,
Springer-Verlag, 2000. He was co-chair for the ACM SIGKDD Workshop on
Data Mining in Bioinformatics (BIOKDD01) and he is guest-editing a
special issue of Information Systems on Bioinformatics and Biological
Data Management; he is also a guest-editor for Distributed and
Parallel Databases special issue on Data Mining (2002), and for SIGKDD
Explorations special issue on Online, Interactive, Anytime Data
Mining. He has co-chaired several special-topics workshops and has
served as program committee member of many international conferences
on data mining. He is a member of the ACM (SIGKDD, SIGMOD), and
IEEE(IEEE Computer Society).

Presentation (PowerPoint File)

Back to Mathematical Challenges in Scientific Data Mining