A General Framework for Biclustering Gene Expression Data

The common clustering methods cannot be applied on the gene expression data with many heterogeneous conditions because the assumption that genes co-express in all conditions is too restricted. In "A General Framework for Biclustering Gene Expression Data" (Submitted for publication), we proposed a novel biclustering method to simultaneously identify groups of genes and groups of conditions based on a universal merit, which in principle can detect any types of biclusters. The experiments show that our approach is very versatile and promising.
Here is a Java implementation of our new method UBCLUST 1.20 beta. The input file should be a simple white-space (or tab) delimited text file without row and column names. Please use java -jar ubc.jar to get the usage as follows:
Usage: java -jar ubc.jar [options] datafile Options: -l <level> discretization levels (default 128) -t <temperature> initial temperature (default 0.00001) -f <factor> temperature factor (default 0.9) -e <estimator> Kolmogorov complexity estimator 0 : Uniform Model (default) 1 : Constant Rows Model 2 : Additive Model 3 : Relaxed OPSM -k <runs> run how many times the MCMC algorithm (default 1) -r trace the MCMC algorithm -h print this help message
In the options, the initial temperature has an important influence in the annealing procedure. A large initial temperature results in a long time annealing but the algorithm more likely returns a global optimal solution. The algorithm returns only one bicluster each time. To obtain multiple biclusters, the users should run the program several times or use the parameter -k. To get different types of biclusters, the users should use different Kolmogorov complexity estimators, which is controlled by the parameter -e. The output files row.txt and col.txt contain the row and column indices of found bicluster(s). The number 1 indicates that the corresponding row or column is in the bicluster.

Please send comments and questions to Haifeng Li

Total visits: