Combinatorics and Statistics of Gene Clusters

Laxmi Parida
IBM Research

The problem is motivated by the need to estimate more accurately the
statistical significance of large gene clusters along a chromosome.
Consider the scenario, where common gene clusters are extracted from two
closely related species such as humans and rats. It is quite possible that
in both, there are further clusters within this cluster and so on. Let the
collection of all possible subclusters within this common cluster that
occur in both the species be denoted by S. One traditional way of computing
the probability of the occurrence of a cluster is to ignore the
sub-clusters S and simply use the probability of occurrence of each gene in
the cluster. The effectiveness of this model is unclear when the number of
genes is very large and the number of occurrences of each gene is very
small. In this talk, we address such a scenario with a method that is
cognizant of S. This is arguably a better estimate of the probability of
the cluster occurrence.

However, the solution to this problem requires the estimation of a function
P-arrangement (k) which we introduce, to understand the combinatorics of
the clusters. The first part of the talk will provide the background and
the use of permutations in a biological setting. Next, we introduce a
certain combinatorial structure known as PQ-trees that is used in
conjunction with P-arrangements to compute the probability of clusters.

Presentation (PDF File)

Back to Workshop IV: Search and Knowledge Building for Biological Datasets