Fast-Shapelets: A Scalable Algorithm for Discovering  Time Series Shapelets



Please note that our paper is completely self-contained, the propose of this webpage is for providing all code and data to readers.

Supporting Material
FAQ Section
There are some questions which we want to clarify:

Q: Is the Fast-Shaplets an approximate algorithm?
A: Yes. Because of shapelet discovery algorithms are slow, even the current state-of-the-art, which is much faster than the original algorithm by a factor of 10, cannot find shapelet on large dataset. Hence, an approximation algorithm for finding shapelets is neccessary. Our shapelets discovery algorithm is comparable to the exact algorithm, i.e. the state-of-the-art, in term of the accuracy. From these 32 datasets, our algorithm is better for 17 datasets and the state-of-the-art is better for 15 datasets. In average from these 32 datasets, our algorithm is 5% better in term of term of the classification accuracy. For more information, please refer to Section 5 in the paper and this spread sheet to show the classification accuracy on 32 datasets from UCR Time Series Archive.

Q: Do we test on 32 datasets or all 45 datasets from UCR Time Series Archive?
A: We tested on all 45 datasets but, unfortunately, the current state-of-the-art can finish running on only 32 datasets in a day without crash. To be fair, we compare the results (e.g., accuracy and running time) based on these 32 datasets. Moreover, if the dataset is larger, our algorithm tends to gain more speeding up. Please refer to Section 5.2 for more detail. We also did compare the result of our algorithm and one nearest neighbor classification in  Section 5.3.

Q: Can our algorithm support generalization versions of shapelets, such as logical shapelets?
A: Absolutely, yes.

Q: May I have the code for creating the whole shapelet tree, not just the first node as shown in the paper?
A: Sure. The code we provided on this webpage is for creating the whole shapelet tree and apply the classification model to unseen data.


Classification Results on UCR Time Series Archive
All datasets we used is summarized here. 
DatasetNameClassesN TrainNtestLength
50Words50450455270
Adiac37390391176
Beef53030470
CBF330900128
ChlorineConcentration34673840166
CinC_ECG_torso44013801639
Coffee22828286
DiatomSizeReduction416306345
ECG200210010096
ECGFiveDays223861136
FaceAll145601690131
FaceFour42488350
FacesUCR142002050131
Fish7175175463
Gun-Point250150150
Haptics51553081092
InlineSkate71005501882
ItalyPowerDemand267102924
Lighting226061637
Lighting777073319
MALLAT85523451024
MedicalImages1038176099
Motes220125284
OliveOil43030570
OSULeaf6200242427
SonyAIBORobotSurface22060170
SonyAIBORobotSurfaceII22795365
StarLightCurves3100082361024
SwedishLeaf15500625128
Symbols625995398
synthetic_control630030060
Trace4100100275
Two_Patterns410004000128
TwoLeadECG223113982
wafer210006,174152
WordsSynonyms25267638270
yoga23003,000426
Cricket (small)2998308
Cricket_X12390390300
Cricket_Y12390390300
Cricket_Z12390390300
uWaveGestureLibrary_X88963582315
uWaveGestureLibrary_Y88963582315
uWaveGestureLibrary_Z88963582315
NonInvECG_Thorax14218001965750
NonInvECG_Thorax24218001965750

Experimental Results


PAMAP Dataset
Physical Activity Monitoring for Aging People (PAMAP) dataset is a time series dataset recorded when subjects perform different activities such as walking, running, jumping, etc. The dataset can be found at www.pamap.org. All outdoor activities from the original PAMAP dataset, shown below on the right, is used in Section 5 in the paper.


*Figure from www.pamap.org.