|
|
|
CS234 Computational model for Biomolecular Data | |
|
||
Oct 02 - Oct 13 | Decide the project and prepare
|
|
Oct 14 - Oct 20 | Project:Motif discovery.
Task : Implement the "random projection" algorithm for motif finding described in Tompa and Buhler, Finding Motifs Using Random Projections, Proc. RECOMB, 67-74, 2001 (also described in the slides). Run the program and collect experimental data. How would you improve its performance? My present process: Reading the paper - it has 37 pages !
|
|
Oct 21 - Oct 27 | Reading Paper Planted(l,d)-Motif Problem Definition: Suppose there is a fixed but unknown nucleotide sequence M (the motif) of length l. The problem is to determine M, given t nucleotide sequences each of length n, and each containing a planted variant of M. More Precisely, each such planted variant is a substring that is M with exactly d point substitutions.
|
|
Oct 28 - Nov 3 | Reading Paper
The projection algorithm: performs a number of independent trials of a basic iterant. In each such trial, it chooses a random projection h and hashes each l-mer x in the input sequences to its bucket h(x). Any hash bucket with sufficiently many entries is explored as a source of the planted motif, using a series of refinement steps. Viewing x as a point in an l-dimensional Hamming space, h(x) is the projection of x onto a k-dimensional subspace. If M is the unknown plated motif, we will call the bucket with hash value h(M) the planted bucket. The fundamental intuition underlying PROJECTION is that, if k<l-d, there is a good chance that a number of the t planted instances of M will hash together into the planted bucket. My understanding: By projection, we delete the noise(d), so we can find the consensus(M) more easily.
|
|
Nov 4 - Nov 10 |
Reading Paper The Random Projection Algorithm Given an (l,d)-planted motif problem with t sequences of n base pairs: 1. Pick a random set of k positions out of l positions. 2. Replace each of the t(n-l+1) l-mers in the data by the k-mer obtained by keeping only those k positions specified in step 1. 3. Any k-mer that occurs s or more times is used to construct a multiple alignment from the corresponding l-mers and simplified MEME algorithm run using it as a seed. 4. Repeat m times with a different sets of k positions. Report the best result. Problem: How to choose parameters k,s,m ? - Choosing k To work k must be <= l-d A smaller k increases the chance of finding s copies of the motif, but if k is too small we will start finding s copies of random sequences as well. Solution : Pick the smallest k so that the expected number of copies of a random k-mer is no more than E<1 : t(n-l+1)/4k <=E - Choosing m Probability that a motif l-mer will not include any mutated bases in the k-mer is given by p=C(l-d,k)/C(l,k) Probability that this applies to fewer than s out of t motifs is given by the Binomial sum t motifs is given by To ensure that we succeed at least once out of m trials with probability at least q we must have: - Choosing s s must be large enough so that any subset of s motifs will be sufficient to find all the motifs s must be large enough so most random k-mers don’t occur s or more times. The number of occurrences is : t(n-l+1)/4k
|
|
Nov 11 - Nov 17 | Design
System design - design algorithm, function and data structure
|
|
Nov 18 - Nov 25 | Programming
Coding and testing
|
|
Nov 25 - Dec 2 | Write report Write my own idea and feeling about this project
|
|
Dec 4 , 5:00 pm |
Demo
perfect Ending!
|
|