A scale dependent clustering model by direct maximization of homogenity and separation

Tony Chan (UCLA), Tarek Mathew and Andy Yip (UCLA) (C)

A scale dependent clustering model by direct maximization of homogeneity and separation
Tony Chan
UCLA Tarek P. Mathew Andy M. Yip

We present a clustering model for grouping a set of high dimensional data into subsets of homogenous clusters which are well-separated from each other. A novel feature of our model is that it allows the user to directly control the scale of the clusters. This is realized by formulating the clustering problem as an optimization problem whose objective function combines two measures of the quality of a clustering, namely homogeneity and separation, and a parameter which controls the scale. We also present a practical heuristic algorithm to solve the optimization problem. For a particular case when we cluster the data based on pairwise similarity measured by Pearson correlation coefficeints, we derived fast methods for evaluation and updating of the objective function. For a data set consisting of n points in a p-dimensional space, our algorithm will output a clustering in O(npd²) floating point operations where d is the number of clusters determined by the algorithm. Experiments are given to demonstrate the usefulness of the algorithm.

Presentation (PDF File)

Back to Mathematical Challenges in Scientific Data Mining