CEM: Transcriptome Assembly and Isoform Expression Level Estimation from Biased RNA-Seq Reads

1. Introduction

CEM is an algorithm to assemble transcripts and estimate their expression levels from RNA-Seq reads.

2. Download

The latest version of CEM can be downloaded here (Last update: 11/17/2012).

Note: if you have IsoLasso 2.6.0 or higher versions installed, you don't have to download CEM separately. Just use the EM options in IsoLasso. Please refer to IsoLasso webpage for more details.

3. Instruction

3.1 Overview and Requirement

CEM runs on Linux system. It is suggested (but not required) that Python 2.7 or higher version is installed. To handle BAM files (which is the default format for many read mapping tools), you need to install SAMTools.

3.2 Compilation and Installation

3.2.1 Download and unzip

After downloading, the source code can be extracted using the following command: "tar xvzf cem.0.9.1.tar.gz".

3.2.2 Compiling C++ codes

The src folder in the source code includes the source codes. Simply run Makefile to compile them:

make

3.2.3 Adding the location of bin directory to your system PATH

For your convenience, you may add the location of the bin directory to your PATH variable. For example, if your folder of CEM is in the path /home/me/cem, then add the following line in the .bashrc file of your home folder:

export PATH="/home/me/cem/bin:"$PATH

3.2.4 Running runcem.py on .sam or .bam file

After compiling, you can run CEM using runcem.py on your read alignment file (.sam or .bam file). If you are provided with the original RNA-Seq reads, you need to first map them to reference genome using read mapping tools, like Tophat. Note: make sure your read alignment tool supports the XS:A tag in the BAM file for splicing reads (like Tophat). See here for more details.

After that, run "runcem.py sam/bam" to run CEM. For example, if your file is test.sam, type

runcem.py{options} test.sam

This command consists two parts, first, runcem.py uses processsam program in the to pre-process sam/bam files to generate an instance file. Then, runcem.py calls another program, isolassocem to run this instance file and outputs assembled transcripts. You can also use

runcem.py test.instance

to run CEM directly on the test.instance file generated by processsam.

Run "runcem.py", "processsam" and "isolassocem" without providing any parameters to see their usages.

Note: If you want CEM to only calculate the expression levels of given transcripts (provided in BED format), use the following command:

runcem.py -x <BED> --forceref test.sam

4. Usage

4.1 runcem.py

Usage: runcem.py {options} < in.bam | in.sam | - >

This is main entry for CEM. It processes sam/bam file or .instance file, and outputs the assembled transcripts. This script will pass all options to processsam and isolassocem program.

4.2 processsam

Usage: processsam {options} <in.sam|->

processsam generates the instance file required for CEM.

Required input: A SAM format file containing the read mapping information, or command line ('-'). See NOTE for further information.

Options:

-n/--isoinfer	Generate IsoInfer input files (.readinfo, .bound and .generange).
-g/--min-gap-length <int>	The minimum length of the gap between two reads to be considered as separate genes. Default 0.
-c/--min-read-num <int>	The minimum number of clustered reads to output. Default 4.
-k/--max-pe-span <int>	The maximum pair-end spanning. Paired-end reads whose spanning exceeds this number will be discarded. Default 700000.
-x/--annotation <string>	Provide existing gene annotation file (in BED format). Adding this parameter will automatically incorporate existing gene annotation information into instance file. The bed file should be sorted according to the chromosome name and starting position of isoforms. This option is mutually exclusive to the -r/--range option.
-r/--range <string>	Use the provided gene ranges specified by the file (in BED format). This option is mutually exclusive to the -x/--annotation option.
-e/--segment-bound <string>	Provide the exon-intron boundary information specified by the filename. See NOTE for more information about the file format.
-s/--max-num-instance	The maximum number of instances be written to the file. Default -1 (no limit)
-u/--min-cvg-cut <0.0-1.0>	The fraction for coverage cutoff, should be between 0-1. A higher value will be more sensitive to coverage discrepancies in one gene. Default 0.05.
-b/--single-only	Treat reads as single-end reads, even if they are paired-end reads.
-j/--min-junc-count <int>	Minimum junction count. Only junctions with no less than this number of supporting reads are considered. Default 1.
-a/--annotation	Output annoation files, including read coverage (.real.wig), read coverage considering junctions and paired-end read spans (.wig), instance range and boundary (.bound.bed), junctions (.bed) and junction summary (.junction.bed).
-v/--no-coverage	Don't output coverage information to the instance file.
-o/--prefix <string>	Specify the prefix of all generated files. The default value is the provided file name.

NOTE:

processsam acceptes STDIN input of sam file by using '-' as filename. This is especially useful if you have the .bam file (e.g., from Tophat output), or you want to do some read filtering before running CEM. For example, if Samtools is installed, then use the following command to run processsam on only chromosome 1 reads:

samtools view accepted_hits.bam chr1 | processsam -a -o accepted_hits -

The sam/bam file must be sorted according to the chromosome name and starting position. The bam file format can be sorted using 'samtools sort' command, while for the sam file, you can use the sort command. In Unix or Mac systems, use the following command:

sort -k 3,3 -k 4,4n in.sam > in.sorted.sam

sort -k 3,3 -k 4,4n in.sam | processsam -a -o accepted_hits -

The exon-intron boundary file (specified by -e/--segment-bound option) records the exon-intron boundary used by CEM. Each line in the file represents one boundary information, and should include chromosome name, start position, end position (equal to start position) and direction (+/-). These fields should be tab-separated, and only the first 4 fields are used. For example,

chr1 15796 15796 +

4.3 isolassocem

Version: 2.6
Usage: isolassocem {options} <Instance file>

Input: the instance file generated by processsam.

Options:

Parameters:
-p/--pairend <int,int>	Specify the paired-end read span and standard derivation. Default 200,20. You may use this Python script to estimate both values from a given SAM/BAM file.
-c/--min-read-num <int>	The minimum number of clustered reads to output. Default 0.
IO Options:
--minexp <float>	The minimum expression level threshold cutoff. Default 0.1.
--verbose	Enable verbose output.
-o/--prefix <string>	Specify the prefix of the output files. The default value is the instance file.
--no-filter	Do not filter isoforms with 0 expression levels. If this option is on, the predicted expression levels of some isoforms will be 0.
--id <string>	Only predict the instance with specified ID.
Reference Options:
-d/--directref	Output gene annotation (the Refs field in the instance file) directly. All expression levels are assigned 1.
--forceref	Calculate the expression levels of gene annotations (the Refs field in the instance file). Using this option will automatically turn on the '--no-filter' option.
CEM Options
--useem	Use EM algorithm instead of LASSO algorithm (which is default) to estimate expression levels. This option is automatically turned on when using runcem.py, but you have to specify it if you use isolassocem by yourself.
--usebias	Use quasi-multinomial bias correction.
--elim	Allow CEM to eliminate low probability isoforms during the iteration.
--correctn	Correct the gene read counts according to the quasi-multinomial bias parameter. Warning: this is an experimental option so use it at your own risk. Due to the sample uncertainty, the calculation of the bias parameter may skew the distribution of some of the highly expressed genes.
--alpha <float>	Specify the parameter of the negative Dirichlet prior. Default 5.
--min-frac <float >	The minimum fraction of isoforms to be reported. Default 0.01. This option is invalid if --no-filter option is set.

5. Update history

2012.11.17 CEM v 0.9.1

Fix bugs in the script runcem.py.
Fix a bug in the program which eliminates the direction of most transcripts.

2012.07.22 CEM v 0.9

by Wei Li