Fast-Shapelets: A Scalable Algorithm for Discovering  Time Series Shapelets

Please note that our paper is completely self-contained, the propose of this webpage is for providing all code and data to readers.

Supporting Material
FAQ Section
There are some questions which we want to clarify:

Q: Is the Fast-Shaplets an approximate algorithm?
A: Yes. Because of shapelet discovery algorithms are slow, even the current state-of-the-art, which is much faster than the original algorithm by a factor of 10, cannot find shapelet on large dataset. Hence, an approximation algorithm for finding shapelets is neccessary. Our shapelets discovery algorithm is comparable to the exact algorithm, i.e. the state-of-the-art, in term of the accuracy. From these 32 datasets, our algorithm is better for 17 datasets and the state-of-the-art is better for 15 datasets. In average from these 32 datasets, our algorithm is 5% better in term of term of the classification accuracy. For more information, please refer to Section 5 in the paper and this spread sheet to show the classification accuracy on 32 datasets from UCR Time Series Archive.

Q: Do we test on 32 datasets or all 45 datasets from UCR Time Series Archive?
A: We tested on all 45 datasets but, unfortunately, the current state-of-the-art can finish running on only 32 datasets in a day without crash. To be fair, we compare the results (e.g., accuracy and running time) based on these 32 datasets. Moreover, if the dataset is larger, our algorithm tends to gain more speeding up. Please refer to Section 5.2 for more detail. We also did compare the result of our algorithm and one nearest neighbor classification in  Section 5.3.

Q: Can our algorithm support generalization versions of shapelets, such as logical shapelets?
A: Absolutely, yes.

Q: May I have the code for creating the whole shapelet tree, not just the first node as shown in the paper?
A: Sure. The code we provided on this webpage is for creating the whole shapelet tree and apply the classification model to unseen data.

Classification Results on UCR Time Series Archive
All datasets we used is summarized here.
 DatasetName Classes N Train Ntest Length 50Words 50 450 455 270 Adiac 37 390 391 176 Beef 5 30 30 470 CBF 3 30 900 128 ChlorineConcentration 3 467 3840 166 CinC_ECG_torso 4 40 1380 1639 Coffee 2 28 28 286 DiatomSizeReduction 4 16 306 345 ECG200 2 100 100 96 ECGFiveDays 2 23 861 136 FaceAll 14 560 1690 131 FaceFour 4 24 88 350 FacesUCR 14 200 2050 131 Fish 7 175 175 463 Gun-Point 2 50 150 150 Haptics 5 155 308 1092 InlineSkate 7 100 550 1882 ItalyPowerDemand 2 67 1029 24 Lighting2 2 60 61 637 Lighting7 7 70 73 319 MALLAT 8 55 2345 1024 MedicalImages 10 381 760 99 Motes 2 20 1252 84 OliveOil 4 30 30 570 OSULeaf 6 200 242 427 SonyAIBORobotSurface 2 20 601 70 SonyAIBORobotSurfaceII 2 27 953 65 StarLightCurves 3 1000 8236 1024 SwedishLeaf 15 500 625 128 Symbols 6 25 995 398 synthetic_control 6 300 300 60 Trace 4 100 100 275 Two_Patterns 4 1000 4000 128 TwoLeadECG 2 23 1139 82 wafer 2 1000 6,174 152 WordsSynonyms 25 267 638 270 yoga 2 300 3,000 426 Cricket (small) 2 9 98 308 Cricket_X 12 390 390 300 Cricket_Y 12 390 390 300 Cricket_Z 12 390 390 300 uWaveGestureLibrary_X 8 896 3582 315 uWaveGestureLibrary_Y 8 896 3582 315 uWaveGestureLibrary_Z 8 896 3582 315 NonInvECG_Thorax1 42 1800 1965 750 NonInvECG_Thorax2 42 1800 1965 750

Experimental Results
• Spreadsheet shows the accuracy comparison to the state-of-the-art (data for creating Figure 8 and Figure 9).
• Spreadsheet shows the scalability on the Starlight dataset (data for creating Figure 10).
• Spreadsheet shows the accuracy comparison to the Euclidean one-nearest-neighbor (data for creating Figure 11).
• Spreadsheet shows the trivial effect of parameters r and k (data for creating Figure 12).

PAMAP Dataset
Physical Activity Monitoring for Aging People (PAMAP) dataset is a time series dataset recorded when subjects perform different activities such as walking, running, jumping, etc. The dataset can be found at www.pamap.org. All outdoor activities from the original PAMAP dataset, shown below on the right, is used in Section 5 in the paper.

 *Figure from www.pamap.org.