Tools and scripts

Here are some simple tools and scripts I wrote; many of them are single scripts for next-generation sequencing.

GTF2BED: converting Cufflinks GTF files to BED predictions

GETINSERTSIZE: Extract insert size information of paired-end reads from Cufflinks output

Estimating paired-end read insert length from SAM/BAM files

GTF2BED: converting Cufflinks GTF files to BED predictions

(This tool is originally posted at blogspot: link)

Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!

Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.

Here is one example of .GTF file predicted by Cufflinks:

chr1 Cufflinks transcript 934797 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1 Cufflinks exon 934797 934812 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1 Cufflinks exon 934906 935655 324 - . gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";

The converted .BED file includes only one line as follows:

chr1 934796 935655 CUFF.29.3 24 - 934796 935655 0,0,255 2 16,750 0,109

The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.

Feel free to comment or ask questions.

GETINSERTSIZE: Extract insert size information of paired-end reads from Cufflinks output

(originally published in blogspot: link)

The new version of Cufflinks tries to estimate the mean and std value of paired-end read insert size automatically. Sometimes we need to extract such information to check our sequencing data. Here I wrote a simple script to search Cufflinks outputs.

The script is written in Python. The implementation is easy: search for key words (like "Estimated Mean", etc) in log files.

Usage: getinsertsize [cufflinks log file]

Note: you may specify different log files using filename wildcards.

Sample output:

File MapMass ReadLength Mean Std
cufflinksinvoke.sh.e2148442 34108402.45 85 150.35 19.18
cufflinksinvoke.sh.e2148443 31007287.59 85 155.19 17.45
cufflinksinvoke.sh.e2148444 37286038.02 85 123.87 25.60

Source code:

Estimating paired-end read insert length from SAM/BAM files

(originally published in blogspot: link)

I wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the PNEXT field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.

This script is distributed in GitHub below.

Usage:

getinsertsize.py [ SAM file ]

or

samtools view [ BAM file ] | getinsertsize.py

Sample output:

Read length: mean 90.6697303194, STD=15.9446036414

Possible read length and their counts:

{108: 43070882, 76: 50882326}

Read span: mean 165.217445903, STD=32.8914834802

Note: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:

head -n 1000000 [ SAM file ] | getinsertsize.py

samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py

by Wei Li