


CS234 Computational model for Biomolecular Data  


Oct 02  Oct 13  Decide the project and prepare


Oct 14  Oct 20  Project:Motif discovery.
Task : Implement the "random projection" algorithm for motif finding described in Tompa and Buhler, Finding Motifs Using Random Projections, Proc. RECOMB, 6774, 2001 (also described in the slides). Run the program and collect experimental data. How would you improve its performance? My present process: Reading the paper  it has 37 pages !


Oct 21  Oct 27  Reading Paper Planted(l,d)Motif Problem Definition: Suppose there is a fixed but unknown nucleotide sequence M (the motif) of length l. The problem is to determine M, given t nucleotide sequences each of length n, and each containing a planted variant of M. More Precisely, each such planted variant is a substring that is M with exactly d point substitutions.


Oct 28  Nov 3  Reading Paper
The projection algorithm: performs a number of independent trials of a basic iterant. In each such trial, it chooses a random projection h and hashes each lmer x in the input sequences to its bucket h(x). Any hash bucket with sufficiently many entries is explored as a source of the planted motif, using a series of refinement steps. Viewing x as a point in an ldimensional Hamming space, h(x) is the projection of x onto a kdimensional subspace. If M is the unknown plated motif, we will call the bucket with hash value h(M) the planted bucket. The fundamental intuition underlying PROJECTION is that, if k<ld, there is a good chance that a number of the t planted instances of M will hash together into the planted bucket. My understanding: By projection, we delete the noise(d), so we can find the consensus(M) more easily.


Nov 4  Nov 10 
Reading Paper The Random Projection Algorithm Given an (l,d)planted motif problem with t sequences of n base pairs: 1. Pick a random set of k positions out of l positions. 2. Replace each of the t(nl+1) lmers in the data by the kmer obtained by keeping only those k positions specified in step 1. 3. Any kmer that occurs s or more times is used to construct a multiple alignment from the corresponding lmers and simplified MEME algorithm run using it as a seed. 4. Repeat m times with a different sets of k positions. Report the best result. Problem: How to choose parameters k,s,m ?  Choosing k To work k must be <= ld A smaller k increases the chance of finding s copies of the motif, but if k is too small we will start finding s copies of random sequences as well. Solution : Pick the smallest k so that the expected number of copies of a random kmer is no more than E<1 : t(nl+1)/4^{k} <=E  Choosing m Probability that a motif lmer will not include any mutated bases in the kmer is given by p=C(ld,k)/C(l,k) Probability that this applies to fewer than s out of t motifs is given by the Binomial sum t motifs is given by To ensure that we succeed at least once out of m trials with probability at least q we must have:  Choosing s s must be large enough so that any subset of s motifs will be sufficient to find all the motifs s must be large enough so most random kmers don’t occur s or more times. The number of occurrences is : t(nl+1)/4^{k}


Nov 11  Nov 17  Design
System design  design algorithm, function and data structure


Nov 18  Nov 25  Programming
Coding and testing


Nov 25  Dec 2  Write report Write my own idea and feeling about this project


Dec 4 , 5:00 pm 
Demo
perfect Ending!

