Detecting time series discords in large disk resident datasets

The project resulted in a publication at ICDM 2007: Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Description (from the abstract of the paper): The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk/tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.

Source code and executables:
cpp_discords.zip, java_discords.zip, matlab_discords.zip
Sample data files:

  Improving the instability of Support Vector Clustering in the case of noisy non-linear manifolds


The project resulted in a publication at ICDM 2007: Locally constrained Support Vector Clustering

Description (from the abstract of the paper): Support vector clustering transforms the data into a high dimensional feature space, where a decision function is computed. In the original space, the function outlines the boundaries of higher density regions, naturally splitting the data into individual clusters. The method, however, though theoretically sound, has certain drawbacks which make it not so appealing to the practitioner. Namely, it is unstable in the presence of outliers and it is hard to control the number of clusters that it identifies. Parametrizing the algorithm incorrectly in noisy settings, can either disguise some objectively present clusters in the data, or can identify a large number of small and nonintuitive clusters.
Here, we explore the properties of the data in small regions building a mixture of factor analyzers. The obtained information is used to regularize the complexity of the outlined cluster boundaries, by assigning suitable weighting to each example. The approach is demonstrated to be less susceptible to noise and to outline better interpretable clusters than support vector clustering alone.

Source code for the algorithm:

  Detecting time series motifs under uniform scaling

The project resulted in a publication at KDD 2007: Detecting time series motifs under uniform scaling

Description (from the abstract of the paper): Time series motifs are approximately repeated patterns found within the data. Such motifs have utility for many data mining algorithms, including rule-discovery, novelty-detection, summarization and clustering. Since the formalization of the problem and the introduction of efficient linear time algorithms, motif discovery has been successfully applied to many domains, including medicine, motion capture, robotics and meteorology.
In this work we show that most previous applications of time series motifs have been severely limited by the definition’s brittleness to even slight changes of uniform scaling, the speed at which the patterns develop. We introduce a new algorithm that allows discovery of time series motifs with invariance to uniform scaling, and show that it produces objectively superior results in several important domains. Apart from being more general than all other motif discovery algorithms, a further contribution of our work is that it is simpler than previous approaches, in particular we have drastically reduced the number of parameters that need to be specified.

  Speeding up nearest-neighbor search for the rotation invarinat Euclidean distance


The project resulted in a publication at SDM 2007: Fast best-match shape searching in rotation invariant metric spaces

Description (from the abstract of the paper): Object recognition and content-based image retrieval systems rely heavily on the accurate and efficient identification of shapes. A fundamental requirement in the shape analysis process is that shape similarities should be computed invariantly to basic geometric transformations, e.g. scaling, shifting, and most importantly, rotations. And while scale and shift invariance are easily achievable through a suitable shape representation, rotation invariance is much harder to deal with.
In this work we explore the metric properties of the rotation invariant distance measures and propose an algorithm for fast similarity search in the shape space. The algorithm can be utilized in a number of important data mining tasks such as shape clustering and classification, or for discovering of motifs and discords in image collections. The technique is demonstrated to introduce a dramatic speed-up over the current approaches, and is guaranteed to introduce no false dismissals.

  Clustering of shapes by reconstruction of the embeddded manifold

The project resulted in a publication at ICDM 2006: Manifold clustering of shapes

Description (from the abstract of the paper): Shape clustering can significantly facilitate the automatic labeling of objects present in image collections. For example, it could outline the existing groups of pathological cells in a bank of cyto-images; the groups of species on photographs collected from certain aerials; or the groups of objects observed on surveillance scenes from an office building.
Here we demonstrate that a nonlinear projection algorithm such as Isomap can attract together shapes of similar objects, suggesting the existence of isometry between the shape space and a low dimensional nonlinear embedding. Whenever there is a relatively small amount of noise in the data, the projection forms compact, convex clusters that can easily be learned by a subsequent partitioning scheme. We further propose a modification of the Isomap projection based on the concept of degree-bounded minimum spanning trees. The proposed approach is demonstrated to move apart bridged clusters and to alleviate the effect of noise in the data.

  Decreasing the bias and the variance of nearest neighbor forecasts

I worked on this problem as part of my internship with Yahoo! Research (Summer 2005) under the guidance of Dr. Dennis DeCoste. We were primarily interested in whether nearest neighbor methods can accurately forecast the number of impressions per webpage for different horizons - 1 month, 2 months or 3 months ahead. The project resulted in a publication at ECML 2006: Ensembles of nearest neighbor forecasts

Description (from the abstract of the paper): Nearest neighbor forecasting models are attractive with their simplicity and the ability to predict complex nonlinear behavior. They rely on the assumption that observations similar to the target one are also likely to have similar outcomes. A common practice in nearest neighbor model selection is to compute the globally optimal number of neighbors on a validation set, which is later applied for all incoming queries. For certain queries, however, this number may be suboptimal and forecasts that deviate a lot from the true realization could be produced.
To address the problem we propose an alternative approach of training ensembles of nearest neighbor predictors that determine the best number of neighbors for individual queries. We demonstrate that the forecasts of the ensembles improve significantly on the globally optimal single predictors.

  Adapting dot plots for real value time series data

The project resulted in a publication at ICTAI 2005: Dot plots for time series analysis

Description (from the abstract of the paper): Since their introduction in the seventies by Gibbs and McIntyre, dot plots have proved to be a powerful and intuitive technique for visual sequence analysis and mining. Their main domain of application is the field of bioinformatics where they are frequently used by researchers in order to elucidate genomic sequence similarities and alignment. However, this useful technique has remained comparatively constrained to domains where the data has an inherent discrete structure (i.e., text).
In this paper we demonstrate how dot plots can be used for the analysis and mining of real-valued time series. We design a tool that creates highly descriptive dot plots which allow one to easily detect similarities, anomalies, reverse similarities, and periodicities as well as changes in the frequencies of repetitions. As the underlying algorithm scales well with the input size, we also show the feasibility of the plots for on-line data monitoring.

  Click Through Protection - improving the quality of Yahoo!'s filtering system


I worked on this problem as part of my internship with the Data Mining and Research (DMR) group in Yahoo! (Summer-Fall 2006) under the guidance of Dr. Nicolas Mayoraz. The focus of our study were multiple clicks coming from the same user (groups of users) within small intervals of time. We demonstrated that the effectiveness of the filtering system with respect to such clicks can be improved with fast decision rules learned from historical click patterns, as well as from conversion feedback provided by different advertisers.

A reference for my work on this project can be found here:

  Categorization of mail sending IPs to enhance the spam detection system

I worked on this problem as part of my internship with the Data Mining and Research (DMR) group in Yahoo! (Summer 2007) under the guidance of Dr. Nicolas Mayoraz. The purpose of the project was to demonstrate whether traffic patterns alone (without additional mail content information) can discriminate between different categories of mail senders. We showed that density based clustering can capture regions in the data space with strikingly similar mail sending patterns. We further tested a number of semi-supervised, as well as multiclass supervised approaches, and demonstrated that traffic information can be used to efficiently and reliably categorize mail senders even in cases where content based mail classification fails.