Yiqun "Eddie" Cao

I am generally fascinated by the design and implementation of algorithms and the art of building efficient, scalable, and user-friendly large-scale computing systems that have a big impact on people's lives. For my PhD study, I mainly work in the areas of algorithm, cheminformatics and bioinformatics. My research interests include graph matching, machine learning, information retrieval, similarity measure of chemical compound, and ligand- and receptor-based virtual screening of chemical compound libraries. I worked on distributed and parallel computing system for my master study. And, did I mention I enjoy coding for these problems?

Below is a list of things I have worked on in reverse chronological order.

EI: Ultra fast similarity search

You type a search term, hit enter, and within a few hundred milliseconds Google can finish picking millions of relevant web pages from billions of candidates. Performing structure search in a chemical compound database is so much less fun. Even with PubChem, the most well-known and well-funded small molecule database, it still takes half a minute to complete one structure similarity search. My research aimed at reducing this gap, improving the usefulness of compound structure databases, and making it possible to cluster millions of compound structures within manageable time constraint. The result was EI, a general method for accelerating similarity search and clustering of object database. When applying EI to structure search and clustering of large compound libraries of tens of millions of entries, I observed very significant results: in one test, we reduced the structure similarity search time from over 90 seconds to less than 0.5 second. This method has the potential to change how people view and use compound structure databases for ever. The EI website has been created using this technique to provide fast search for the whole PubChem Compounds database. Read more...

You may refer to my EI paper for more information.

Automated Docking

With the constant growth of screening compound libraries and the soaring cost of screening all these compounds, using computational method to perform virtual screening as a pre-screening step has become more and more important. Molecular docking is an effective technique for virtual screening, and has been used successfully in discovering novel ligands against protein targets with known structures. However, with many parameters to choose, file formats to manipulate, and decisions to make, it remains to be a manual process and cannot be computed on a large scale. I was involved in a feasibility study of automated docking by building and testing such a system that features a web-based UI. My responsibilities included building an automated benchmark system, integrating a more refined (but slow) docking system into UCSF DOCK, studying multiple consensus scoring schemes, and prototyping an automated system that performs docking study using all eligible entries in PDB. Read more...

You may refer to the DockBlaster paper for more information.

Though the most challenging part of this project for me was to get to know UCSF DOCK and later PLOP, after the context had been built and I could work on full speed, my work was mainly about data collection, processing and reporting, automation of simulation runs, and design and implementation of a database with a web frontend. The following is some of the tools and techniques involved:

Python: the main language used.
R: the data processing and reporting language. The actual R programs were generated by Python on the fly from templates.
LaTeX: used to generate the actual report in PDF format.
diff: used to perform patching on configuration file when automating simulation runs.
Django: used to build the web frontend for the (nice) presentation of the reports for the automated runs.

Besides interesting observations on using consensus scoring in molecular docking, the work also resulted in several products, some of which are still being used.

ePLOP: a wrapper for PLOP to make it much easier to use. It's the point-and-shoot for running PLOP.
mpose: an automation script to run DOCK in single mode. Without it, running single-mode DOCK can be very tedious.
thicol: a prototype system for a product called "the hit collection". It performs automatic docking against the whole PDB using DockBlaster and generates reports ready to be viewed throug web-based frontend.

Maximum Common Substructure-based similarity

How to measure the similarity between a pair of compound structures is one of the fundamental questions in ligand-based virtual screening and cheminformatics. Traditional structure similarity measures are often too rigid or consider only global similarities between structures. In this research, I used the maximum common substructure (MCS)-based approach as a flexible alternative for measuring similarity and predicting bioactive compounds. Several contributions were made. First, a new backtracking algorithm for MCS was proposed and implemented. The ANSI C-based implementation is portable and very flexible. Bindings for several popular scripting languages are also provided. Second, the effectiveness of MCS-based similarity search was compared to a traditional method utilizing global structure similarity. Third, a general method was proposed to use any similarity measure, or even a combination of several similarity measures, with Support Vector Machine to predict bioactivity. Finally, tests were performed to see whether combining MCS-based with traditional similarity measure would result in higher accuracy in predicting bioactivity. Read more...

You may refer to my MCS paper for more information.

ChemmineR: Cheminformatic Tools in R

Computational methods for structure comparison and statistical analysis tools lay the groundwork for building powerful cheminformatic techniques and systems. ChemmineR combines both by building the first open-source compound mining framework for the popular statistical programming environment, R. The package provides functions for structural similarity searches, compound clustering, screening library management, online batch viewing of chemical structures, etc. Users familiar with the R environment can easily build sophisticated compound library analysis pipelines by taking advantage of this package and the extensive statistical and machine learning resources available in R.

A C++-based package, libdescriptor, has been built by rewriting many functions in ChemmineR using C++. libdescriptor is a faster alternative to ChemmineR that handles large compound libraries more efficiently. Bindings for several popular scripting languages have also been provided. I even use the binding for R to create an optional "plug-in" package for ChemmineR. When installed, this package, called ChemmineR Performance Pack, would greatly improve the performance of ChemmineR. Read more...

You may refer to my ChemmineR paper for more information.

Our Next-Generation Compound Database

ChemMine NG is the next-generation ChemMine (see below). The new ChemMine is based on Django and has clean modular design. A compound database module, a screening database module, a content management module, and a ChemMine Platform module are planned and being constructed. Besides the overall architecture, I am responsible for the ChemMine Platform feature. The ChemMine Platform allows applications to be built by third parties and served and run from any web server. The ChemMine Platform performs the role of a terminal (i.e., keyboard and screen, or input and output), and also brings the concept of pipe to allow more sophisticated functions to be assembled from existing applications.

BAP DB: The Phenotype Screening Database

BAP DB is a web-based database for exploring the biological and molecular functions of genes based on available phenotype and screening data from mutant, transgenic and wildtype organisms. It allows the public to query, view and download a wide spectrum of heterogeneous screening data. The user-friendly screening upload feature allows heterogeneous screening data, such as quantitative assay data, annotation information, images, and movies, to be uploaded and associated with genes in the database.

From a technical point of view, BAP DB is a spin-off of the ChemMine (see below) project, and borrowed its first codebase from the screening database of ChemMine. Over the time, many BAP DB-specific features have been added.

ChemMine: an All-In-One Compound Database

ChemMine is a compound mining database that facilitates drug and agrochemical discovery and chemical genomics screens. Three major components of ChemMine are 1) a chemical compound database featuring over 6 million compounds and structural and annotation search facilities, 2) a cheminformatics workbench providing tools for analyzing structure and properties of compounds, and 3) a screening database that works like a content management system for screening data. ChemMine is among the most important project of our lab, and it has been the driving force and testbed for many of our new ideas. For instance, the need to speed up similarity search led us to develop the EI method. Also, as an umbrella project, ChemMine also provides web component for smaller projects and allows them to be well integrated with other ChemMine data and services. For example, ChemMine provides web-based visualization service for ChemmineR. Read more...

You may refer to the following original ChemMine paper for more (though outdated) information.

Girke T, Cheng LC, Raikhel N. ChemMine. A compound mining database for chemical genomics. Plant Physiol: 138, 573-577.HubMed

Distributed and Parallel Computing

For my master study in the National University of Singapore, I worked on distributed and parallel computing and their applications in biomedical applications. I also designed a simple master-worker application framework called Abstraction Communication Layer (ACL). ACL allows parallel programs to be created without knowledge about the underlying hardware, which can be a computer cluster or an arbitrary set of PCs or even just one machine. Read more...

Using MPICH2, I built parallel programs to solve biofluid simulations and speckle image processing. Latency in communication and the use of non-blocking communication were the main focus of the study.

ACL support backends for MPICH2 and UNIX Socket were planned. Only the backend for MPICH2 was implemented and included as part of the package.

My master thesis includes more on these topics.

Anti-virus Research

Conducted research on a network-facilitated anti-virus system. The basic idea used was similar to today's Microsoft SpyNet. The system tracks the propagation path of malicious programs to achieve effective detection and even automatic removal of unknown malwares and reverse some if their actions. Read more...

You can find more about this research in the following publication (in Chinese):

Ying Li, Ben Qiu, Yiqun Cao, Jian Jiao, Xiuming Shan, Yong Ren. A New DNS for Network Virus Inspection and Location. Computer Engineering 31(19). link

Yiqun Eddie Cao