CEM is an algorithm to assemble transcripts and estimate their expression levels from RNA-Seq reads.
The latest version of CEM can be downloaded here (Last update: 11/17/2012).
Note: if you have IsoLasso 2.6.0 or higher versions installed, you don't have to download CEM separately. Just use the EM options in IsoLasso. Please refer to IsoLasso webpage for more details.
CEM runs on Linux system. It is suggested (but not required) that Python 2.7 or higher version is installed. To handle BAM files (which is the default format for many read mapping tools), you need to install SAMTools.
After downloading, the source code can be extracted using the following command: "tar xvzf cem.0.9.1.tar.gz".
The src folder in the source code includes the source codes. Simply run Makefile to compile them:
make
For your convenience, you may add the location of the bin directory to your PATH variable. For example, if your folder of CEM is in the path /home/me/cem, then add the following line in the .bashrc file of your home folder:
export PATH="/home/me/cem/bin:"$PATH
After compiling, you can run CEM using runcem.py on your read alignment file (.sam or .bam file). If you are provided with the original RNA-Seq reads, you need to first map them to reference genome using read mapping tools, like Tophat. Note: make sure your read alignment tool supports the XS:A tag in the BAM file for splicing reads (like Tophat). See here for more details.
After that, run "runcem.py sam/bam" to run CEM. For example, if your file is test.sam, type
runcem.py{options} test.sam
This command consists two parts, first, runcem.py uses processsam program in the to pre-process sam/bam files to generate an instance file. Then, runcem.py calls another program, isolassocem to run this instance file and outputs assembled transcripts. You can also use
runcem.py test.instance
to run CEM directly on the test.instance file generated by processsam.
Run "runcem.py", "processsam" and "isolassocem" without providing any parameters to see their usages.
Note: If you want CEM to only calculate the expression levels of given transcripts (provided in BED format), use the following command:
runcem.py -x <BED> --forceref test.sam
Usage: runcem.py {options} < in.bam | in.sam | - >
This is main entry for CEM. It processes sam/bam file or .instance file, and outputs the assembled transcripts. This script will pass all options to processsam and isolassocem program.
Usage: processsam {options} <in.sam|->
processsam generates the instance file required for CEM.
Required input: A SAM format file containing the read mapping information, or command line ('-'). See NOTE for further information.
Options:
-n/--isoinfer | Generate IsoInfer input files (.readinfo, .bound and .generange). |
-g/--min-gap-length <int> | The minimum length of the gap between two reads to be considered as separate genes. Default 0. |
-c/--min-read-num <int> | The minimum number of clustered reads to output. Default 4. |
-k/--max-pe-span <int> | The maximum pair-end spanning. Paired-end reads whose spanning exceeds this number will be discarded. Default 700000. |
-x/--annotation <string> | Provide existing gene annotation file (in BED format). Adding this parameter will automatically incorporate existing gene annotation information into instance file. The bed file should be sorted according to the chromosome name and starting position of isoforms. This option is mutually exclusive to the -r/--range option. |
-r/--range <string> | Use the provided gene ranges specified by the file (in BED format). This option is mutually exclusive to the -x/--annotation option. |
-e/--segment-bound <string> | Provide the exon-intron boundary information specified by the filename. See NOTE for more information about the file format. |
-s/--max-num-instance | The maximum number of instances be written to the file. Default -1 (no limit) |
-u/--min-cvg-cut <0.0-1.0> | The fraction for coverage cutoff, should be between 0-1. A higher value will be more sensitive to coverage discrepancies in one gene. Default 0.05. |
-b/--single-only | Treat reads as single-end reads, even if they are paired-end reads. |
-j/--min-junc-count <int> | Minimum junction count. Only junctions with no less than this number of supporting reads are considered. Default 1. |
-a/--annotation | Output annoation files, including read coverage (.real.wig), read coverage considering junctions and paired-end read spans (.wig), instance range and boundary (.bound.bed), junctions (.bed) and junction summary (.junction.bed). |
-v/--no-coverage | Don't output coverage information to the instance file. |
-o/--prefix <string> | Specify the prefix of all generated files. The default value is the provided file name. |
NOTE:
samtools view accepted_hits.bam chr1 | processsam -a -o accepted_hits -
sort -k 3,3 -k 4,4n in.sam > in.sorted.sam
to sort in.sam into in.sorted.sam, or use the pipe:sort -k 3,3 -k 4,4n in.sam | processsam -a -o accepted_hits -
chr1 15796 15796 +
Parameters: | |
-p/--pairend <int,int> | Specify the paired-end read span and standard derivation. Default 200,20. You may use this Python script to estimate both values from a given SAM/BAM file. |
-c/--min-read-num <int> | The minimum number of clustered reads to output. Default 0. |
IO Options: | |
--minexp <float> | The minimum expression level threshold cutoff. Default 0.1. |
--verbose | Enable verbose output. |
-o/--prefix <string> | Specify the prefix of the output files. The default value is the instance file. |
--no-filter | Do not filter isoforms with 0 expression levels. If this option is on, the predicted expression levels of some isoforms will be 0. |
--id <string> | Only predict the instance with specified ID. |
Reference Options: | |
-d/--directref | Output gene annotation (the Refs field in the instance file) directly. All expression levels are assigned 1. |
--forceref | Calculate the expression levels of gene annotations (the Refs field in the instance file). Using this option will automatically turn on the '--no-filter' option. |
CEM Options | |
--useem | Use EM algorithm instead of LASSO algorithm (which is default) to estimate expression levels. This option is automatically turned on when using runcem.py, but you have to specify it if you use isolassocem by yourself. |
--usebias | Use quasi-multinomial bias correction. |
--elim | Allow CEM to eliminate low probability isoforms during the iteration. |
--correctn | Correct the gene read counts according to the quasi-multinomial bias parameter.
Warning: this is an experimental option so use it at your own risk. Due to the sample uncertainty, the calculation of the bias parameter may skew the distribution of some of the highly expressed genes. |
--alpha <float> | Specify the parameter of the negative Dirichlet prior. Default 5. |
--min-frac <float > | The minimum fraction of isoforms to be reported. Default 0.01. This option is invalid if --no-filter option is set. |
2012.11.17 CEM v 0.9.1
2012.07.22 CEM v 0.9
by Wei Li