Biomolecular Project

Mo Cao

email: mcao002@ucr.edu

Date: January - March, 2011

General Info.:

This project will implement the maximal unique match between two chromosomes. Chromosomes are represented as long strings. This program usses suffix tree data structure and linear suffix tree match algorithm to achieve the match. The high level idea of this program is to build a suffix tree for one known chromosome, then find the maximal unique match between another unknown chromosome and the suffix tree. Suffix tree data structure has been proved that it efficiently solves matching problems between long strings. The time complexity is O(m+n), where m and n are the length of two strings. The programming language used in this project is C++.

The First Part: building a suffix tree for the known chromosome

Briefly, suffix tree for an m-character string S is a rooted directed tree with exactly m leaves. Each edge is labeled with a nonempty substring of S. For any leaf i, the concatenation of the edge-label on the path from the root to the leaf i spells out the suffix of S that starts at position i. Thus, the suffix tree for string S represents all suffixes of S. The key point of a suffix tree is that each path form the root to a leaf represents one unique suffix of S. But by the definition of suffix tree, a path-label from the root down to the leaf may be some prefix of S. In order to avoid this situation, we put a special symbol "$" after the last symbol of S. Thus, each path from the root to a leaf exactly is labeled a suffix of S.

The time complexity of building a suffix tree is O(m), where m is the length of the known chromosome. Basically, there are three linear algorithms to build a suffix tree, which are Ukkonen algorithm, Weiner algorithm, and McCreight algorithm. All of them run in linear time. However, considering the space complexity, Ukkonen and McCreight algorithms save much more space than Weiner algorithm. The space complexity of Ukkonen and McCreight algorithm is O(m). I will use the Ukkonen algorithm in this project.

Details on Ukkonen algorithm

The high level idea of Ukkonen algorithm is to construct a suffix tree for each prefix S[1..i] of S, starting from S[1] and incrementing i by one until m. By the end, it will construct a complete suffix tree for S. Each suffix tree for S[1..i+1] is based on the suffix tree S[1..i], by using the extension rules to update the suffix tree S[1..i]. Each phase (i+1) is further divided into (i+1) extensions. In each extension j, it follows the three extensions. In order to achieve the time complexity O(m), there are three tricks in Ukkonen's algorithm. By using the edge-label compression, the algorithm achieves the space complexity which is O(m).

The Second Part: linear matching in suffix tree

coming...

Reference

Dan Gusfield, "Algorithms on strings, trees, and sequences".