MS/CS234 Project

Mapping Short Reads to a Reference Genome


02/24/12: Update on Synthesis

I completed a single mapping unit and synthesized it successfully with four systolic array cells/rows of the traditional SW feedback array. This means a four base pair read can be matched to a stream of any length. Unfortunately, when I tried to increase the read length to 101, the synthesis was not able to complete the Place and Route phase in a reasonable amount of time. Remember that the goal is to have many of these units in parallel, and even one is hard to synthesize due to the massive amount of code generated by ROCCC (The main VHDL file is 35,000 lines.). I will consider the feasibility and usefulness of shorter read lengths (in the 30s, perhaps), since some technologies yield reads that are much shorter than 100 base pairs.

Another significant roadblock is how to reprogram the mapping units so that they can map different short reads. Since ROCCC is in development, many features I would like to have are not available, such as the ability to nest Systems (Only Modules are nestable.) and the ability to use systolic array generation in different ways. The latter problem requires that one of the input strings (the read) is required to be declared as a const array in C. This would be an easy problem to get around by changing the VHDL later, if the VHDL code wasn't so huge and hard to read. This is my biggest problem right now: designing a wrapper (or modify the unit itself) that will be able to dynamically change the reads each mapping unit is assigned, which are currently hardcoded in the VHDL code.


02/10/12: Informal Problem Statement

This project's goal is to use ROCCC, a C-to-VHDL compiler, to enable easy usage of C source code describing the Smith-Waterman algorithm for configuring an FPGA that will be able to provide large speedups over current Smith-Waterman implementations that run on software. It will use many mapping units, which take advantage of ROCCC's systolic array optimization, in parallel. Each unit will correspond to a short read/query, and the reference genome will be streamed through all of the units. The short reads will be randomly generated (with artificial sequencing errors) from a reference genome obtained from yet undetermined sources on the internet.