Fast Shapelets - Supporting Website

Fast-Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets

Please note that our paper is completely self-contained, the propose of this webpage is for providing all code and data to readers.

Supporting Material

The colorful paper published in SDM 2013 can be downloaded here.
Slide to show all figures in higher resolution.
C++ source doe to run all experiments is here.
The current state-of-the-art's code can be found here [mirror].

FAQ Section
There are some questions which we want to clarify:

Q: Is the Fast-Shaplets an approximate algorithm?
A: Yes. Because of shapelet discovery algorithms are slow, even the current state-of-the-art, which is much faster than the original algorithm by a factor of 10, cannot find shapelet on large dataset. Hence, an approximation algorithm for finding shapelets is neccessary. Our shapelets discovery algorithm is comparable to the exact algorithm, i.e. the state-of-the-art, in term of the accuracy. From these 32 datasets, our algorithm is better for 17 datasets and the state-of-the-art is better for 15 datasets. In average from these 32 datasets, our algorithm is 5% better in term of term of the classification accuracy. For more information, please refer to Section 5 in the paper and this spread sheet to show the classification accuracy on 32 datasets from UCR Time Series Archive.

Q: Do we test on 32 datasets or all 45 datasets from UCR Time Series Archive?
A: We tested on all 45 datasets but, unfortunately, the current state-of-the-art can finish running on only 32 datasets in a day without crash. To be fair, we compare the results (e.g., accuracy and running time) based on these 32 datasets. Moreover, if the dataset is larger, our algorithm tends to gain more speeding up. Please refer to Section 5.2 for more detail. We also did compare the result of our algorithm and one nearest neighbor classification in Section 5.3.

Q: Can our algorithm support generalization versions of shapelets, such as logical shapelets?
A: Absolutely, yes.

Q: May I have the code for creating the whole shapelet tree, not just the first node as shown in the paper?
A: Sure. The code we provided on this webpage is for creating the whole shapelet tree and apply the classification model to unseen data.

Classification Results on UCR Time Series Archive
All datasets we used is summarized here.

DatasetName	Classes	N Train	Ntest	Length
50Words	50	450	455	270
Adiac	37	390	391	176
Beef	5	30	30	470
CBF	3	30	900	128
ChlorineConcentration	3	467	3840	166
CinC_ECG_torso	4	40	1380	1639
Coffee	2	28	28	286
DiatomSizeReduction	4	16	306	345
ECG200	2	100	100	96
ECGFiveDays	2	23	861	136
FaceAll	14	560	1690	131
FaceFour	4	24	88	350
FacesUCR	14	200	2050	131
Fish	7	175	175	463
Gun-Point	2	50	150	150
Haptics	5	155	308	1092
InlineSkate	7	100	550	1882
ItalyPowerDemand	2	67	1029	24
Lighting2	2	60	61	637
Lighting7	7	70	73	319
MALLAT	8	55	2345	1024
MedicalImages	10	381	760	99
Motes	2	20	1252	84
OliveOil	4	30	30	570
OSULeaf	6	200	242	427
SonyAIBORobotSurface	2	20	601	70
SonyAIBORobotSurfaceII	2	27	953	65
StarLightCurves	3	1000	8236	1024
SwedishLeaf	15	500	625	128
Symbols	6	25	995	398
synthetic_control	6	300	300	60
Trace	4	100	100	275
Two_Patterns	4	1000	4000	128
TwoLeadECG	2	23	1139	82
wafer	2	1000	6,174	152
WordsSynonyms	25	267	638	270
yoga	2	300	3,000	426
Cricket (small)	2	9	98	308
Cricket_X	12	390	390	300
Cricket_Y	12	390	390	300
Cricket_Z	12	390	390	300
uWaveGestureLibrary_X	8	896	3582	315
uWaveGestureLibrary_Y	8	896	3582	315
uWaveGestureLibrary_Z	8	896	3582	315
NonInvECG_Thorax1	42	1800	1965	750
NonInvECG_Thorax2	42	1800	1965	750

Experimental Results

Spreadsheet shows the accuracy comparison to the state-of-the-art (data for creating Figure 8 and Figure 9).
Spreadsheet shows the scalability on the Starlight dataset (data for creating Figure 10).
Spreadsheet shows the accuracy comparison to the Euclidean one-nearest-neighbor (data for creating Figure 11).
Spreadsheet shows the trivial effect of parameters r and k (data for creating Figure 12).

PAMAP Dataset
Physical Activity Monitoring for Aging People (PAMAP) dataset is a time series dataset recorded when subjects perform different activities such as walking, running, jumping, etc. The dataset can be found at www.pamap.org. All outdoor activities from the original PAMAP dataset, shown below on the right, is used in Section 5 in the paper.

*Figure from www.pamap.org.

The general information of PAMAP dataset can be found here (or from the original wegpage).
Description of the data collection protocal is provided here (or from the original wegpage).
An example of our shapelet tree created from our algorithm is here.