Li Wei's Research Page

My research has focused on the mining of large datasets. More specifically, I am working on time series data mining and visualization. My advisor is Dr. Eamonn Keogh. Here are some of my publications.

Time series is perhaps the most commonly encountered data type, touching almost every aspect of human life, including finance, medicine, meteorology and entertainment. Due to the massive size, the high dimensionality, and the (typically) large amount of noise, most classic machine learning and data mining algorithms do not work well for time series data. The emphasis of my research is on the effective and efficient algorithms to discover important patterns in time series data. Follows are some projects I have been working on.

In many applications it is desirable to monitor a streaming time series for predefined patterns. In domains as diverse as the monitoring of space telemetry, patient intensive care data, and insect populations, where data streams at a high rate and the number of predefined patterns is large, it may be impossible for the comparison algorithm to keep up. We propose a novel technique that exploits the commonality among the predefined patterns to allow monitoring at higher bandwidths, while maintaining a guarantee of no false dismissals. Our approach is based on the widely used envelope-based lower-bounding technique. Experiments in diverse domains demonstrate that, our approach achieves tremendous improvements in performance in the offline case, and significant improvements in the fastest possible arrival rate of the data stream that can be processed with guaranteed no false dismissal.

The problem of time series classification has attracted great interest in the last decade. However current research assumes the existence of large amounts of labeled training data. In reality, such data may be very difficult or expensive to obtain. As in many other domains, there are often copious amounts of unlabeled data available. In this work we propose a semi-supervised technique for building time series classifiers. While such algorithms are well known in text domains, we will show that special considerations must be made to make them both efficient and effective for the time series domain. We evaluate our work with a comprehensive set of experiments on diverse data sources including electrocardiograms, handwritten documents, manufacturing, and video datasets. The experimental results demonstrate that our approach requires only a handful of labeled examples to construct accurate classifiers.

The matching of two-dimensional shapes is an important problem with applications in domains as diverse as biometrics, industry, medicine and anthropology. The distance measure used must be invariant to many distortions, including scale, offset, noise, partial occlusion, etc. Among these distortions, rotation invariance seems to be uniquely difficult. Current approaches typically try to achieve rotation invariance in the representation of the data, at the expense of poor discrimination ability, or in the distance measure, at the expense of efficiency. In this work we show that we can take the slow but accurate approaches and dramatically speed them up. On real world problems our technique can take current approaches and make them four orders of magnitude faster, without false dismissals. Moreover, our technique can be used with any of the existing shape representations and with all the most popular distance measures.

Most visualization tools introduced in the literature are specialized for a particular task. In this work, we introduce a novel framework which allows visualization to take place in any GUI based operating system. Our system works by replacing the standard file icons with automatically generated icons that reflect the contents of the files in a principled way. We call such icons Intelligent Icons. While there is little utility in examining an individual icon, examining groups of them provides a greater possibility of unexpected and serendipitous discoveries. The utility of Intelligent Icons can be further enhanced by arranging them on the screen in a way that reflects their similarity/differences. In addition we show that our system is unique in also supporting fast and intuitive similarity search.

Most current anomaly (novelty/interestingness/surprisingness) detection algorithms are "custom made" for particular domains, which requires extensive effort by domain expert. In this work, we designed and implemented an online anomaly detection system that does not need to be customized for individual domains, yet performs with exceptionally high precision/recall. The system is based on the recently introduced idea of time series bitmaps. Experiments on datasets from domains as diverse as ECGs, Space Shuttle telemetry monitoring, video surveillance, and respiratory data show that our system is effective in finding anomalies in time series. You are welcome to have a try of the Anomaly Detection Applet.

I worked as a research assistant in WebDB and P2P Computing Lab at Fudan University from 1999 to 2003. At that time, my interest was data mining, more specifically, small pattern discovery in large databases. My advisor was Prof. Aoying Zhou.