ISP: Accurate Inference of Isoforms from Multiple Sample RNA-Seq Data.


Installation Instruction

System Requirements

ISP is written in C++ and can be run on Linux/Unix systems. ISP requires the following packages: GLPK and QuadProg.

Download

The latest code (version 0.3, Oct. 15, 2014) can be downloaded here.

Compilation and Installation

1. Unzip

After downloading, the source code can be extracted using the following command: "tar xvzf isp.0.3.tar.gz" (for isp version 0.3).

2. Compiling C++ codes

The src folder in the source code includes the source codes. Simply run Makefile to compile them:

make

3. Adding the location of bin directory to your system PATH

For your convenience, you may add the location of the bin directory to your PATH variable. For example, if your folder of CEM is in the path /home/me/cem, then add the following line in the .bashrc file of your home folder:

export PATH="/home/me/isp/bin:"$PATH

4. (Optional) Specify the path of the GLPK and QuadProg header files and library files

The compiling requires the header files of both packages, and the linking and running rely on two library files: libglpk.so (GLPK library) and libQuadProgpp.so (QuadProg library). Generally you don't need to worry about the library files, but if you encounter the "library not found" or "header not found" error in compilation or execution, you need to modify two environmental variables, CXXFLAGS and LD_LIBRARY_PATH.

CXXFLAGS is used for compiling and linking. After installation, locate the positions of the header files and the library files (should include libglpk.so and libQuadProgpp.so), and specify them using "-I" and "-L" options in CXXFLAGS, respectively. For example, if the header files are in /home/me/include, and the library files are in /home/me/lib, then run the following command in shell (or put it in your .bashrc file in your home folder):

export CXXFLAGS="$CXXFLAGS -I/home/me/include -L/home/me/lib"

For running the program, modify the LD_LIBRARY_PATH as follows:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/me/lib


Demo and the explanation of results

1. Running the demo

In the test folder, a simple test case is provided. Simply run the runisp.sh to test the program.

2. The final results

If the demo is running succesfully, there will be one folder out coming out, storing all intermediate and final results. The final isoform prediction is located in the following files:

allpred.n.ins.pred Isoform prediction file. For each line, one isoform is specified, indicating:
  1. the isoform index in a gene;
  2. the isoform id;
  3. chromosome name;
  4. orientation;
  5. transcript start;
  6. transcript end;
  7. exon start coordinate;
  8. exon end coordinate;
  9. average expression level in all samples.
allpred.n.ins.pred.explv The expression level of isoforms in differet samples.
allpred.n.ins.pred.gtf The structures of isoforms in gtf file.

here, n is the number of samples used (n=3 in our demo).

3. Understanding the whole workflow of ISP

The flow algorithm is only one small part of the workflow. For assembly from multiple samples, one needs to first integrate information from multiple samples individually. Generally, for a given set of samples, there are mainly 5 steps:

  1. First pass of all the bam files. This step will generate the gene boundary and exon/intron boundary and read coverage of individual samples.
  2. Merge. This step will try to merge the gene boundaries, exon/intron boundaries and read coverages in step 1. The output is a "universal" gene boundary, exon/intron boundary, used in step 3
  3. Second pass of all the bam files. This step will use output of step 2 to record information of individual samples, given the definition of gene boundary and exon/intron boundary.
  4. Merge the results of the 2nd pass. This step will try to merge all files in step 3 into a single file, which will be used in step 5 for prediction
  5. Prediction. This is the core of the algorithm. Given the files in step 4, use flow algorithm to predict transcripts
  6. Post-prediction. This step will process the results in step 5.

4. Intermediate files

For your convenience, the explanations of intermediate files are provided below.
allpred.n.ins The multiple instance file (generated by step 4), recording all the information necessary to perform ISP algorithm.
allbound.n.bed The merged gene boundary (by step 2).
alljunc.n.txt The merged junction read statistics (by step 2).
merged.n.bed The merged gene boundary (by step 2).
pthreadpred_i_n.pred/pthreadpred_i_n.pred.explv The predicted isoform and isoform expression levels by different threads (by step 5).
sample_i.1st.bed/.bound.bed/.instance/.junc.bed/.real.wig/.wig The first scan results of individual samples (step 1), including junction reads, gene boundary, IsoLasso instance, junction read summary, the read coverage (without introns), and the read coverage (with introns).
sample_i.2nd.n.instance The second scan results of individual samples (step 3), indicated as a IsoLasso-compatible instance file.

here, n is the number of samples used (n=3 in our demo), and i is the sample index (i=0,1,2 in our demo).


Parameters

runminst

runminst is the main portal of the ISP program.

Usage:

runminst {OPTIONS} <BAM file 1|COMMAND 1> <BAM file 2|COMMAND 2> ...

Options:

-h/--help Pring help message.
-r/--range [range] Specify the range of the bam file. The ranges are specified as 'chrname:start-end'; if multiple ranges are specified, they must be separated by comma (,).
-t/--tmp [tmp] Specify the temporary dir. Default tmp/
-p/--pthread [int] Specify the number of threads used. Default 1.
--debug [int,...] Specify the steps executed. The integer value must be between 1-6. Default 1,2,3,4,5,6.
-c/--command Instead of providing BAM file names, use STDOUT of commands executable under the current path.
-s/--sam The file format is SAM instead of BAM. If it is on, -r/--range option is not allowed.
-L/--label <string,...> Specify the labels of the input files, separated by comma. The number of labels must equal to the input files. Default "sample_n".

predminst

predminst is the core algorithm of the ISP program (step 5).

Usage: predminst {OPTIONS} <instance file>

-h Print help information
-i [ID] Predict only the instance with specified ID.
-o [file] The prefix of the output file.
-p x,y Only predict instances with ID pattern x,y. For example, if (x,y)=(3,4), the program will only predict instances with ID 3,7,11,15,...
--min-frac [0.0-1.0] Only predict isoforms with expression greater than this fraction of the most abundant isoform. Default 0.1.
--rd-alpha [0.0-1.0] The weights for read supporting variables. Default 0.1
--assemble-by-sample Assemble the transcripts sample-by-sample. This will lead to higher sensitivity but lower precision.
--no-correlation Do not perform segment correlation.

History

2014.10.15, version 0.3


Wei Li, DFCI/HSPH and UCR, 2014.