||CS234 Computational model for Biomolecular Data|
|Oct 02 - Oct 13||Decide the project and prepare
|Oct 14 - Oct 20||Project:Motif discovery.
Task : Implement the "random projection" algorithm for motif finding described in Tompa and Buhler, Finding Motifs Using Random Projections, Proc. RECOMB, 67-74, 2001 (also described in the slides). Run the program and collect experimental data. How would you improve its performance?
My present process: Reading the paper - it has 37 pages !
|Oct 21 - Oct 27||Reading Paper
Planted(l,d)-Motif Problem Definition: Suppose there is a fixed but unknown nucleotide sequence M (the motif) of length l. The problem is to determine M, given t nucleotide sequences each of length n, and each containing a planted variant of M. More Precisely, each such planted variant is a substring that is M with exactly d point substitutions.
|Oct 28 - Nov 3||Reading Paper
The projection algorithm: performs a number of independent trials of a basic iterant. In each such trial, it chooses a random projection h and hashes each l-mer x in the input sequences to its bucket h(x). Any hash bucket with sufficiently many entries is explored as a source of the planted motif, using a series of refinement steps.
Viewing x as a point in an l-dimensional Hamming space, h(x) is the projection of x onto a k-dimensional subspace. If M is the unknown plated motif, we will call the bucket with hash value h(M) the planted bucket. The fundamental intuition underlying PROJECTION is that, if k<l-d, there is a good chance that a number of the t planted instances of M will hash together into the planted bucket.
My understanding: By projection, we delete the noise(d), so we can find the consensus(M) more easily.
|Nov 4 - Nov 10||
The Random Projection Algorithm
Given an (l,d)-planted motif problem with t sequences of n base pairs:
1. Pick a random set of k positions out of l positions.
2. Replace each of the t(n-l+1) l-mers in the data by the k-mer obtained by keeping only those k positions specified in step 1.
3. Any k-mer that occurs s or more times is used to construct a multiple alignment from the corresponding l-mers and simplified MEME algorithm run using it as a seed.
4. Repeat m times with a different sets of k positions.
Report the best result.
Problem: How to choose parameters k,s,m ?
- Choosing k
To work k must be <= l-d
A smaller k increases the chance of finding s copies of the motif, but if k is too small we will start finding s copies of random sequences as well.
Solution : Pick the smallest k so that the expected number of copies of a random k-mer is no more than E<1 : t(n-l+1)/4k <=E
- Choosing m
Probability that a motif l-mer will not include any mutated bases in the k-mer is given by p=C(l-d,k)/C(l,k)
Probability that this applies to fewer than s out of t motifs is given by the Binomial sum t motifs is given by
To ensure that we succeed at least once out of m trials with probability at least q we must have:
- Choosing s
s must be large enough so that any subset of s motifs will be sufficient to find all the motifs
s must be large enough so most random k-mers donít occur s or more times. The number of occurrences is : t(n-l+1)/4k
|Nov 11 - Nov 17||Design
System design - design algorithm, function and data structure
|Nov 18 - Nov 25||Programming
Coding and testing
|Nov 25 - Dec 2||Write report
Write my own idea and feeling about this project
Dec 4 , 5:00 pm