Proteins are central components of cell machinery and life.
Several researchers have pointed out that to fully understand cell machinery, studying proteins in isolation is not enough (clusters of) interactions need to be delineated as well, since it is strongly believed that proteins work with other proteins to regulate and support each other for specific functions.
Recent advances in technology have enabled scientists to determine, identify and validate pair-wise protein interactions through a range of experimental and in-silico methods.
Such data can be naturally represented in the form of (multiple interaction networks.
The task of extracting relevant groupings or functional modules from such interaction networks, for the purposes of understanding the behavior of organisms, protein function prediction and drug design is challenging and an active area of research.
The challenges are daunting. First, is the issue of data integration and data quality. Different experimental and in-silico methods can be used to compute interactions, each with its own strengths and weaknesses. Often, the overlap, in terms of common interactions across experimental settings, is not very high. An added complexity is that the data obtained from such methods is believed to be quite noisy many interactions obtained even by a single methodology are conjectured to be false positives.
Second, even if the network is assumed to be noise free, partitioning the network using classical graph partitioning or clustering schemes is inherently difficult. A common characteristic of ProteinProtein Interaction (PPI) networks is that, a few nodes (hubs) have very large degrees, while most other nodes have very few interactions. Applying traditional clustering approaches typically results in a clustering arrangement that is quite poor containing one or a few giant core clusters and several tiny clusters (possibly singleton clusters).
Third, some proteins are believed to be multi-functionaleffective strategies for soft clustering of these essential proteins are needed. This dictates the need to leverage or adapt soft clustering approaches.
In this talk, we make the case for an ensemble clustering framework to address these problems. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins.
Back to Workshop IV: Search and Knowledge Building for Biological Datasets
This is joint work with my graduate students Sitaram Asur and Duygu Ucar.