Fast-Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets
Please note that our paper is completely self-contained, the propose of
this webpage is for providing all code and data to readers.
Supporting
Material
FAQ Section
There are some questions which we want to clarify:
Q: Is the Fast-Shaplets an approximate algorithm?
A: Yes. Because
of shapelet discovery algorithms are slow, even the current
state-of-the-art, which is much faster than the original algorithm by a
factor of 10, cannot find shapelet on large dataset. Hence,
an approximation algorithm for finding shapelets is neccessary. Our
shapelets discovery algorithm is comparable to the exact
algorithm, i.e. the state-of-the-art, in term of the accuracy. From
these 32 datasets, our algorithm is better for 17 datasets and the
state-of-the-art is better for 15 datasets. In average from these 32 datasets, our algorithm
is 5% better in term of term of the classification accuracy. For more information, please refer to Section 5 in the paper
and this spread sheet to show the classification accuracy on 32
datasets from UCR Time Series Archive.
Q: Do we test on 32 datasets or all 45 datasets from UCR Time Series Archive?
A: We
tested on all 45 datasets but, unfortunately, the current
state-of-the-art can
finish running on only 32 datasets in a day without crash. To be
fair, we compare the results (e.g., accuracy
and running time) based on these 32 datasets. Moreover, if the
dataset
is larger, our algorithm tends to gain more speeding up. Please refer
to Section 5.2 for more detail. We also did compare the result of our
algorithm and one nearest neighbor classification in Section 5.3.
Q: Can our algorithm support generalization versions of shapelets, such as logical shapelets?
A: Absolutely, yes.
Q: May I have the code for creating the whole shapelet tree, not just the first node as shown in the paper?
A:
Sure. The code we provided on this webpage is for creating the whole
shapelet tree and apply the classification model to unseen data.
Classification Results on UCR Time Series Archive
All datasets we used is summarized here.
DatasetName | Classes | N Train | Ntest | Length |
50Words | 50 | 450 | 455 | 270 |
Adiac | 37 | 390 | 391 | 176 |
Beef | 5 | 30 | 30 | 470 |
CBF | 3 | 30 | 900 | 128 |
ChlorineConcentration | 3 | 467 | 3840 | 166 |
CinC_ECG_torso | 4 | 40 | 1380 | 1639 |
Coffee | 2 | 28 | 28 | 286 |
DiatomSizeReduction | 4 | 16 | 306 | 345 |
ECG200 | 2 | 100 | 100 | 96 |
ECGFiveDays | 2 | 23 | 861 | 136 |
FaceAll | 14 | 560 | 1690 | 131 |
FaceFour | 4 | 24 | 88 | 350 |
FacesUCR | 14 | 200 | 2050 | 131 |
Fish | 7 | 175 | 175 | 463 |
Gun-Point | 2 | 50 | 150 | 150 |
Haptics | 5 | 155 | 308 | 1092 |
InlineSkate | 7 | 100 | 550 | 1882 |
ItalyPowerDemand | 2 | 67 | 1029 | 24 |
Lighting2 | 2 | 60 | 61 | 637 |
Lighting7 | 7 | 70 | 73 | 319 |
MALLAT | 8 | 55 | 2345 | 1024 |
MedicalImages | 10 | 381 | 760 | 99 |
Motes | 2 | 20 | 1252 | 84 |
OliveOil | 4 | 30 | 30 | 570 |
OSULeaf | 6 | 200 | 242 | 427 |
SonyAIBORobotSurface | 2 | 20 | 601 | 70 |
SonyAIBORobotSurfaceII | 2 | 27 | 953 | 65 |
StarLightCurves | 3 | 1000 | 8236 | 1024 |
SwedishLeaf | 15 | 500 | 625 | 128 |
Symbols | 6 | 25 | 995 | 398 |
synthetic_control | 6 | 300 | 300 | 60 |
Trace | 4 | 100 | 100 | 275 |
Two_Patterns | 4 | 1000 | 4000 | 128 |
TwoLeadECG | 2 | 23 | 1139 | 82 |
wafer | 2 | 1000 | 6,174 | 152 |
WordsSynonyms | 25 | 267 | 638 | 270 |
yoga | 2 | 300 | 3,000 | 426 |
Cricket (small) | 2 | 9 | 98 | 308 |
Cricket_X | 12 | 390 | 390 | 300 |
Cricket_Y | 12 | 390 | 390 | 300 |
Cricket_Z | 12 | 390 | 390 | 300 |
uWaveGestureLibrary_X | 8 | 896 | 3582 | 315 |
uWaveGestureLibrary_Y | 8 | 896 | 3582 | 315 |
uWaveGestureLibrary_Z | 8 | 896 | 3582 | 315 |
NonInvECG_Thorax1 | 42 | 1800 | 1965 | 750 |
NonInvECG_Thorax2 | 42 | 1800 | 1965 | 750 |
Experimental Results
- Spreadsheet shows the accuracy comparison to the state-of-the-art (data for creating Figure 8 and Figure 9).
- Spreadsheet shows the scalability on the Starlight dataset (data for creating Figure 10).
- Spreadsheet shows the accuracy comparison to the Euclidean one-nearest-neighbor (data for creating Figure 11).
- Spreadsheet shows the trivial effect of parameters r and k (data for creating Figure 12).
PAMAP Dataset
Physical
Activity Monitoring for Aging People (PAMAP) dataset is a time series
dataset recorded when subjects perform different activities such as
walking, running, jumping, etc. The dataset can be found at www.pamap.org. All outdoor activities from the original PAMAP dataset, shown below on the right, is used in Section 5 in the paper.