IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly

1. Introduction

IsoLasso is an algorithm to assemble transcripts and estimate their expression levels from RNA-Seq reads.

2. Download

The latest version, IsoLasso v 2.6.1, can be downloaded here. (Last update: 11/17/2012). The full paper can be downloaded here.

Due to the wide popularity of the IsoLasso C++ version, The MATLAB version of IsoLasso is discontinued since v 2.5. However, you can still find the MATLAB scripts of IsoLasso in previous versions.

NOTE: if you have compiling problems related to CGAL or GSL library, go to Section 3.2.1 for alternative solutions.

3. Instruction

3.1 Overview and Requirement

IsoLasso right now runs on Linux system. It requires Matlab with optimization toolbox installed into your Linux system. Now Matlab environment is not required. To run IsoLasso in a more convenient way, it is suggested (but not required) that Python 2.7 or higher version is installed.

The source code mainly consists of two parts, Matlab code (in matlab folder) and C++ code (in src folder). The main algorithm is originally implemented in Matlab and are now ported to C++ thanks to Yingsheng (Daniel) Gao. Other preprocessing tools are written in C++. Another Python script, runlasso.py, is used as the main entry of the program.

To handle BAM files (which is the default format for many read mapping tools), you need to install SAMTools.

3.2 Compilation and Installation

3.2.1 Prerequisites

If you have the Matlab environment, no third-party libraries are required except standard C++. However, if you want IsoLasso to run without Matlab, the C++ codes rely on GSL(http://www.gnu.org/s/gsl/) and CGAL(http://www.cgal.org/) library. A GCC version >4.3 is needed to compile the codes.

For most Debian/Ubuntu systems, both libraries are provided as standard packages (libgsl-dev, libcgal-dev) and are easy to install.

If you don't have the root privilege to install both packages, or you encounter some "file/library not found" errors, you may need to download the source code, compile and install by yourself.

Note: We provide an alternative program, CEM, if you really don't want GSL or CGAL libraries. CEM shares much of the IsoLasso but uses the EM algorithm instead of the quadratic program to estimate isoform expressions. CEM does not useany library functions from GSL and CGAL.

3.2.2 Specify the path of the GSL and CGAL header files and library files

The compiling requires the header files of both packages, and the linking and running rely on two library files: libgsl.so (GSL library) and libCGAL.so (CGAL library). You need to modify two environmental variables, CXXFLAGS and LD_LIBRARY_PATH.

CXXFLAGS is used for compiling and linking. After installation, locate the positions of the header files and the library files (should include libCGAL.so), and specify them using "-I" and "-L" options in CXXFLAGS, respectively. For example, if the header files are in /home/me/include, and the library files are in /home/me/lib, then run the following command in shell (or put it in your .bashrc file in your home folder):

export CXXFLAGS="$CXXFLAGS -I/home/me/include -L/home/me/lib"

For running the program, modify the LD_LIBRARY_PATH as follows:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/me/lib

3.2.3 For Linux 64 users

The IsoLasso source code includes a copy of the CGAL library (in isolassocpp/CGAL/lib). You don't need to download or compile the CGAL code; just copy the files to some other places and set up CXXFLAGS and LD_LIBRARY_PATH variables as above.

3.2.4 Compiling C++ codes

The src folder in the source code includes some programs written in C++. Simply run Makefile to compile them:

make

Changing your runlasso.py script (no longer required in version 2.1)

If Python 3.0 is installed in your system, you can use runlasso.py (in scripts/bin folder) to conveniently run IsoLasso program. You need to modify the runlasso.py to change some definitions of the paths, including the path of your MATLAB program, and the path of your source folder. See runlasso.py for more details.

3.2.5 Adding the location of bin directory to your system PATH

For your convenience, you may add the location of the bin directory to your PATH variable. For example, if your folder of IsoLasso is in the path /home/me/isolasso, then add the following line in the .bashrc file of your home folder:

export PATH="/home/me/isolasso/bin:"$PATH

3.2.6 Running runlasso.py on .sam or .bam file

After compiling, you can run IsoLasso using runlasso.py on your read alignment file (.sam or .bam file). If you are provided with the original RNA-Seq reads, you need to first map them to reference genome using read mapping tools, like Tophat or SpliceMap.

After that, run "runlasso.py sam/bam" to run IsoLasso. For example, if your file is test.sam, type

runlasso.py{options} test.sam

to run IsoLasso. This command consists two parts, first, runlasso.py uses processsam program in the to pre-process sam/bam files to generate an instance file. Then, runlasso.py calls another program, isolasso to run this instance file and outputs assembled transcripts. You can also use

runlasso.py test.instance

to run IsoLasso directly on the test.instance file generated by processsam.

Run "runlasso.py", "processsam" and "isolasso" without providing any parameters to see their usages.

Note: If you want IsoLasso to only calculate the expression levels of given transcripts (provided in BED format), use the following command:

runlasso.py -x <BED> --forceref test.sam

3.2.7 The format of the instance file

Click here to see a detailed description of the instance file generated by processsam program.

4. Usage

4.1 runlasso.py

Usage: runlasso.py {options} < in.bam | in.sam | - >

This is main entry for IsoLasso. It processes sam/bam file or .instance file, and outputs the assembled transcripts. This script will pass all options to processsam and isolasso program.

4.2 processsam

Usage: processsam {options} <in.sam|->

processsam generates the instance file required for IsoLasso.

Required input: A SAM format file containing the read mapping information, or command line ('-'). See NOTE for further information.

Options:

-n/--isoinfer	Generate IsoInfer input files (.readinfo, .bound and .generange).
-g/--min-gap-length <int>	The minimum length of the gap between two reads to be considered as separate genes. Default 0.
-c/--min-read-num <int>	The minimum number of clustered reads to output. Default 4.
-k/--max-pe-span <int>	The maximum pair-end spanning. Paired-end reads whose spanning exceeds this number will be discarded. Default 700000.
-x/--annotation <string>	Provide existing gene annotation file (in BED format). Adding this parameter will automatically incorporate existing gene annotation information into instance file. The bed file should be sorted according to the chromosome name and starting position of isoforms. This option is mutually exclusive to the -r/--range option.
-r/--range <string>	Use the provided gene ranges specified by the file (in BED format). This option is mutually exclusive to the -x/--annotation option.
-e/--segment-bound <string>	Provide the exon-intron boundary information specified by the filename. See NOTE for more information about the file format.
-s/--max-num-instance	The maximum number of instances be written to the file. Default -1 (no limit)
-u/--min-cvg-cut <0.0-1.0>	The fraction for coverage cutoff, should be between 0-1. A higher value will be more sensitive to coverage discrepancies in one gene. Default 0.05.
-b/--single-only	Treat reads as single-end reads, even if they are paired-end reads.
-j/--min-junc-count <int>	Minimum junction count. Only junctions with no less than this number of supporting reads are considered. Default 1.
-a/--annotation	Output annoation files, including read coverage (.real.wig), read coverage considering junctions and paired-end read spans (.wig), instance range and boundary (.bound.bed), junctions (.bed) and junction summary (.junction.bed).
-v/--no-coverage	Don't output coverage information to the instance file.
-o/--prefix <string>	Specify the prefix of all generated files. The default value is the provided file name.

NOTE:

processsam acceptes STDIN input of sam file by using '-' as filename. This is especially useful if you have the .bam file (e.g., from Tophat output), or you want to do some read filtering before running IsoLasso. For example, if Samtools is installed, then use the following command to run processsam on only chromosome 1 reads:

samtools view accepted_hits.bam chr1 | processsam -a -o accepted_hits -

The sam/bam file must be sorted according to the chromosome name and starting position. The bam file format can be sorted using 'samtools sort' command, while for the sam file, you can use the sort command. In Unix or Mac systems, use the following command:

sort -k 3,3 -k 4,4n in.sam > in.sorted.sam

sort -k 3,3 -k 4,4n in.sam | processsam -a -o accepted_hits -

The exon-intron boundary file (specified by -e/--segment-bound option) records the exon-intron boundary used by IsoLasso. Each line in the file represents one boundary information, and should include chromosome name, start position, end position (equal to start position) and direction (+/-). These fields should be tab-separated, and only the first 4 fields are used. For example,

chr1 15796 15796 +

4.3 isolasso

Version: 2.6
Usage: isolasso {options} <Instance file>

Input: the instance file generated by processsam.

Options:

Parameters:
-p/--pairend <int,int>	Specify the paired-end read span and standard derivation. Default 200,20. You may use this Python script to estimate both values from a given SAM/BAM file.
-c/--min-read-num <int>	The minimum number of clustered reads to output. Default 0.
IO Options:
--minexp <float>	The minimum expression level threshold cutoff. Default 0.1.
--verbose	Enable verbose output.
-o/--prefix <string>	Specify the prefix of the output files. The default value is the instance file.
--no-filter	Do not filter isoforms with 0 expression levels. If this option is on, the predicted expression levels of some isoforms will be 0.
--id <string>	Only predict the instance with specified ID.
Reference Options:
-d/--directref	Output gene annotation (the Refs field in the instance file) directly. All expression levels are assigned 1.
--forceref	Calculate the expression levels of gene annotations (the Refs field in the instance file). Using this option will automatically turn on the '--no-filter' option.
CEM Options
--useem	Use EM algorithm instead of LASSO algorithm (which is default) to estimate expression levels.
--usebias	Use quasi-multinomial bias correction.
--elim	Allow CEM to eliminate low probability isoforms during the iteration.
--correctn	Correct the gene read counts according to the quasi-multinomial bias parameter. Warning: this is an experimental option so use it at your own risk. Due to the sample uncertainty, the calculation of the bias parameter may skew the distribution of some of the highly expressed genes.
--alpha <float>	Specify the parameter of the negative Dirichlet prior. Default 5.
--min-frac <float >	The minimum fraction of isoforms to be reported. Default 0.01. This option is invalid if --no-filter option is set.

5. Update history

2012.11.17 IsoLasso v 2.6.1

Updates:

Fix bugs in the script runlasso.py.
Fix a bug in the program which eliminates the direction of most transcripts.

2012.07.22 IsoLasso v 2.6.0

Updates:

Rewrite the data structure of instance file, reducing the instance file size. (Note that the new instance file structure is incompatible with the old ones and vice verca.)
Fix a bug that some instances may lead to bad quadratic programming problems when calculating the expression levels. In this case, the program will automatically switch to EM method to calculate the expression levels.

2012.04.04 IsoLasso v 2.5.2

Updates:

Rewrite the data structure of paired-end read information in instance file, reducing the instance file size.
Fix a few bugs.
For IsoLasso C++ program, an experimental EM method is implemented to assemble isoforms and estimate expression levels. See the help information in isolasso program.

2012.02.22 IsoLasso v 2.5

Updates:

IsoLasso now accepts gene annotation file to estimate expression levels. The annotation file should be in BED format and can be used by '-x/--annotation' option in processsam, and '--forceref' option in isolasso to calculate the expression levels of given isoforms.
Rewrite runlasso.py and IsoLasso C++ code to fix some bugs.
Modify some file structures.

2012.02.08 IsoLasso v 2.4.1

Updates:

Remove some unnecessary source files.

2011.12.20 IsoLasso v 2.4

Updates:

Rewrite parts of the clustering program to improve memory usage performance.

2011.12.03 IsoLasso v 2.3

Updates:

Thanks to Yingsheng (Daniel) Gao, the Matlab code is translated into C++ code now.
Rewrite some parts of processsam program to improved performance in read clustering.
Disable refonly and -x options.

2011.10.21 IsoLasso v 2.2

Updates:

Use the symbol '-' instead of the string 'STDIN' as an indication of command line input. This style is compatible with many other programs.
Fix a bug for displaying error messages when paired-end reads are present.

2011.09.26 IsoLasso v 2.1

Updates:

User don't need to modify runlasso.py before running IsoLasso.
Fix a bug which the expression levels of some isoforms are negative values. This fix may cause the program run slower than previous versions.
For SAM/BAM format, support CIGAR characters of "I" (insertion) and "D" (deletion).

2011.7.9 IsoLasso v 2.0

Some important updates:

IsoLasso now supports the alignment format of both SAM and BAM. For the BAM file, if samtools is installed, IsoLasso will use it to convert to SAM format first.
Update the matlab code to avoid using Statistics Toolbox (some users may not have this toolbox).
IsoLasso can run on provided transcript structures. Use "processsam -x [bed file] -r" options to generate instances containing provided transcripts only, and use 'refonly' option in running IsoLasso Matlab program.
Some bugs fixed.

2011.1.13 IsoLasso v 1.0

by Wei Li