Commoness, Complexity, Flavors and Function of Intrinsic Protein Disorder: A Bioinformatic Study

Zoran Obradovic (Temple University) (I)

Intrinsic protein disorder refers to segments or to whole proteins that fail to fold to a fixed 3D structure on their own. Contrary to the {Sequence} => {3D Structure} => {Function} paradigm, there are examples of proteins with long intrinsic disorders that carry out function. In order to realize the potential of the human genome project, it is essential to determine the commonness and types of intrinsic protein disorder and to determine the set of functions carried out by such proteins. Towards such an objective we assembled a database of known disordered protein sequence segments and used it for developing predictors of protein disorder from the primary sequence information. In addition to designing global classifiers trained on all disorder data, we also used disjoint data subsets for developing specialized disorder predictors. These partitions were initially defined using domain specific knowledge, but we also employed a novel incremental competitive machine learning algorithm that automatically partitions a set of available disordered proteins into subsets with similar properties. In the talk, we will describe data mining and machine learning procedures used in the study and will report results obtained by analyzing sequences from the Protein Data Bank, Swiss Protein database and 28 complete genomes. The obtained results provide strong evidence that:
(1) disorder is a very common element of protein structure;
(2) strength of disorder predictions is correlated to the sequence complexity;
(3) eucaryotes may have a higher proportion of intrinsic protein disorder than eubacteria or archaebacteria; and
(4) at least three different types of protein disorder exist in nature.

The reported results were obtained through a collaboration with C.J. Brown, A.K. Dunker, P. Romero and S. Vucetic funded by NSF-CSE-II-9711532, NSF-IIS-0196237 and NIH-R01-LM06916 research grants. More details on this project can be found at www.ist.temple.edu/~zoran/bioinformatics.html

Bio:
Zoran Obradovic is the Director at the Center for Information Science and Technology and a Professor of Computer and Information Sciences at Temple University. His research interests focus on solving challenging Bioinformatics, Geostatistics and Computational Finance problems by developing and integrating data mining and statistical learning technology for an efficient knowledge discovery at large databases. Funded by NSF, NIH, DOE and industry, during the last decade he contributed to about 120 refereed articles on these and related topics and to several academic and commercial software systems.



Back to Mathematical Challenges in Scientific Data Mining