•  Course  
CS234  Computational model for Biomolecular Data
  •  Project progress

           Oct  02 - Oct 13 Decide the project and prepare


           Oct  14 - Oct 20 Project:Motif discovery.

 Task : Implement the "random projection" algorithm for motif finding described in Tompa and Buhler, Finding Motifs Using Random Projections, Proc. RECOMB, 67-74, 2001 (also described in the slides). Run the program and collect experimental data. How would you improve its performance?

 My present process: Reading the paper - it has 37 pages !


         Oct  21 - Oct 27 Reading Paper

  Planted(l,d)-Motif  Problem Definition: Suppose there is a fixed but unknown nucleotide sequence M (the motif) of length l. The problem is to determine M, given t nucleotide sequences each of length n, and each containing a planted variant of M. More Precisely, each such planted variant is a substring that is M with exactly d point substitutions.


           Oct  28 - Nov  3 Reading Paper 

The projection algorithm: performs a number of independent trials of a basic iterant. In each such trial, it chooses a random projection h and hashes each l-mer x in the input sequences to its bucket h(x). Any hash bucket with sufficiently many entries is explored as a source of the planted motif, using a series of refinement steps. 

Viewing  x as a point in an  l-dimensional Hamming space, h(x) is the projection of  x onto a k-dimensional subspace. If  M is the unknown plated motif, we will call the bucket with hash value h(M) the planted bucket. The fundamental intuition underlying PROJECTION is that, if k<l-d, there is a good chance that a number of the t planted instances of M will hash together into the planted bucket.

My understanding: By projection, we delete the noise(d), so we can find the consensus(M) more easily.


        Nov   4 - Nov 10

Reading Paper 

The Random Projection Algorithm

  Given an (l,d)-planted motif problem with t sequences of n base pairs:

   1. Pick a random set of k positions out of l positions.

   2. Replace each of the t(n-l+1) l-mers in the data by the k-mer obtained by     keeping only those k positions specified in step 1.

   3. Any k-mer that occurs s or more times is used to construct a multiple alignment from the corresponding l-mers and simplified MEME algorithm run using it as a seed.

   4. Repeat m times with a different sets of k positions.

        Report the best result.

Problem: How to choose parameters k,s,m ?

 - Choosing k

       To work k must be <= l-d

       A smaller k increases the chance of finding s copies of the motif, but if k is too small we will start finding s copies of random sequences as well.

   Solution : Pick the smallest k so that the expected number of copies of a random k-mer is no more than  E<1 :   t(n-l+1)/4k <=E

 - Choosing m

       Probability that a motif l-mer will not include any mutated bases in the k-mer is given by p=C(l-d,k)/C(l,k)

       Probability that this applies to fewer than s out of t motifs is given by the Binomial sum t motifs is given by

       To ensure that we succeed at least once out of m trials with probability at least q we must have:

 - Choosing s

     s must be large enough so that any subset of s motifs will be sufficient  to find all the motifs 

     s must be large enough so most random k-mers donít occur s or more times. The number of occurrences is : t(n-l+1)/4k


           Nov 11 - Nov 17 Design 

System design - design  algorithm, function and data structure


           Nov 18 - Nov 25 Programming 

Coding and testing


           Nov 25 - Dec  2 Write report

Write my own idea and feeling about this project



         Dec   4 , 5:00 pm


perfect Ending!