Minimum Conditional Entropy Clustering

A preliminary and common methodology for analyzing gene expression data is the clustering technique. In Minimum Entropy Clustering and Applications to Gene Expression Analysis (in CSB04), we proposed the minimum entropy clustering (MEC) algorithm that efficiently minimizes the conditional entropy of clusters given samples. Many mathematical facts, such as Fano's inequality and Bayes probability of error, indicate that our method is a natural method to group patterns. The experimental results on synthetic data and real gene expression data show that the method performs better than the popular methods such as hierarchical clustering, k-means, self-organizing maps (SOMs), and expectation maximization (EM) algorithm, in terms of adjusted Rand index. In particular, the method performs very well even when the correct number of clusters is unknown. The method can also correctly reveal the structure of data and effectively identify outliers simultaneously. These improvements are very important in practice because it is difficult for biologists to know the exact number of clusters in many situations. Besides, outliers often contain important hidden information and provide clues for unknown knowledge, e.g., a gene with abnormal expression may be related to some disease.

Here is a Java implementation of our algorithm MEC 1.03. The input file should be a simple white-space (or tab) delimited text file without row and column names. The output is the group number of each row. Please use java -jar mec.jar to get the usage as follows:

Usage: java -jar mec.jar [options] datafile
  -a <alpha>  alpha in alpha-entropy (default 1 for Shannon's entropy)
  -c <number> number of clusters     (default 2)
  -k <kernel> Kernel of Parzen window
                0 : Hypercube (default)
                1 : Gaussian
  -w <width>  Bandwidth of Parzen window (default is 1.5)
              The bandwidth of i-th deimension is set to w*sigma,
              where sigma is the estimated standard deviation of
              i-th dimension.
  -i <I>      Run <I> times k-means for initialization (default 1)
  -v          verbose mode
  -h          print this help message
In the options, the width of Parzen window is important. For the datasets of high dimensionality, a large width (say, 2.5 or 3) is recommended. However, a too large width is also not appropriate, which makes the algorithm to run slow and produce poor results. The optimal width depends on the datasets. For the parameter c, the number of clusters, the users should set a larger number than that they expect. The extra clusters could be used to detect outliers. For other parameters, the defult should be fine for most datasets.

Recently, gene expression data with many heterogeneous conditions appeared. The common clustering methods (including MEC) cannot be applied on these datasets because the assumption that genes co-express in all conditions is too restricted. Please see UBCLUST for our universal biclustering algroithm that simultaneously identify groups of genes and groups of conditions.

The following Java Applet is a demonstration of MEC algorithm compared with k-means, deterministic annealing (DA) clustering, and self-organizing maps (SOMs), in which we use several 2-dimensional synthetic data. The k-means is employed to give the initial partition of our method.

Please send comments and questions to Haifeng Li

Total visits: Web Counter