Fast-Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets

Please note that our paper is completely self-contained, the propose of this webpage is for providing all code and data to readers.

Supporting Material

- The colorful paper published in SDM 2013 can be downloaded here.

- Slide to show all figures in higher resolution.
- C++ source doe to run all experiments is here.
- The current state-of-the-art's code can be found here [mirror].

There are some questions which we want to clarify:

Q: Is the Fast-Shaplets an approximate algorithm?

A: Yes. Because of shapelet discovery algorithms are slow, even the current state-of-the-art, which is much faster than the original algorithm by a factor of 10, cannot find shapelet on large dataset. Hence, an approximation algorithm for finding shapelets is neccessary. Our shapelets discovery algorithm is comparable to the exact algorithm, i.e. the state-of-the-art, in term of the accuracy. From these 32 datasets, our algorithm is better for 17 datasets and the state-of-the-art is better for 15 datasets. In average from these 32 datasets, our algorithm is 5% better in term of term of the classification accuracy. For more information, please refer to Section 5 in the paper and this spread sheet to show the classification accuracy on 32 datasets from UCR Time Series Archive.

Q: Do we test on 32 datasets or all 45 datasets from UCR Time Series Archive?

A: We tested on all 45 datasets but, unfortunately, the current state-of-the-art can finish running on only 32 datasets in a day without crash. To be fair, we compare the results (e.g., accuracy and running time) based on these 32 datasets. Moreover, if the dataset is larger, our algorithm tends to gain more speeding up. Please refer to Section 5.2 for more detail. We also did compare the result of our algorithm and one nearest neighbor classification in Section 5.3.

Q: Can our algorithm support generalization versions of shapelets, such as logical shapelets?

A: Absolutely, yes.

Q: May I have the code for creating the whole shapelet tree, not just the first node as shown in the paper?

A: Sure. The code we provided on this webpage is for creating the whole shapelet tree and apply the classification model to unseen data.

Classification Results on UCR Time Series Archive

All datasets we used is summarized here.

DatasetName | Classes | N Train | Ntest | Length |

50Words | 50 | 450 | 455 | 270 |

Adiac | 37 | 390 | 391 | 176 |

Beef | 5 | 30 | 30 | 470 |

CBF | 3 | 30 | 900 | 128 |

ChlorineConcentration | 3 | 467 | 3840 | 166 |

CinC_ECG_torso | 4 | 40 | 1380 | 1639 |

Coffee | 2 | 28 | 28 | 286 |

DiatomSizeReduction | 4 | 16 | 306 | 345 |

ECG200 | 2 | 100 | 100 | 96 |

ECGFiveDays | 2 | 23 | 861 | 136 |

FaceAll | 14 | 560 | 1690 | 131 |

FaceFour | 4 | 24 | 88 | 350 |

FacesUCR | 14 | 200 | 2050 | 131 |

Fish | 7 | 175 | 175 | 463 |

Gun-Point | 2 | 50 | 150 | 150 |

Haptics | 5 | 155 | 308 | 1092 |

InlineSkate | 7 | 100 | 550 | 1882 |

ItalyPowerDemand | 2 | 67 | 1029 | 24 |

Lighting2 | 2 | 60 | 61 | 637 |

Lighting7 | 7 | 70 | 73 | 319 |

MALLAT | 8 | 55 | 2345 | 1024 |

MedicalImages | 10 | 381 | 760 | 99 |

Motes | 2 | 20 | 1252 | 84 |

OliveOil | 4 | 30 | 30 | 570 |

OSULeaf | 6 | 200 | 242 | 427 |

SonyAIBORobotSurface | 2 | 20 | 601 | 70 |

SonyAIBORobotSurfaceII | 2 | 27 | 953 | 65 |

StarLightCurves | 3 | 1000 | 8236 | 1024 |

SwedishLeaf | 15 | 500 | 625 | 128 |

Symbols | 6 | 25 | 995 | 398 |

synthetic_control | 6 | 300 | 300 | 60 |

Trace | 4 | 100 | 100 | 275 |

Two_Patterns | 4 | 1000 | 4000 | 128 |

TwoLeadECG | 2 | 23 | 1139 | 82 |

wafer | 2 | 1000 | 6,174 | 152 |

WordsSynonyms | 25 | 267 | 638 | 270 |

yoga | 2 | 300 | 3,000 | 426 |

Cricket (small) | 2 | 9 | 98 | 308 |

Cricket_X | 12 | 390 | 390 | 300 |

Cricket_Y | 12 | 390 | 390 | 300 |

Cricket_Z | 12 | 390 | 390 | 300 |

uWaveGestureLibrary_X | 8 | 896 | 3582 | 315 |

uWaveGestureLibrary_Y | 8 | 896 | 3582 | 315 |

uWaveGestureLibrary_Z | 8 | 896 | 3582 | 315 |

NonInvECG_Thorax1 | 42 | 1800 | 1965 | 750 |

NonInvECG_Thorax2 | 42 | 1800 | 1965 | 750 |

Experimental Results

- Spreadsheet shows the accuracy comparison to the state-of-the-art (data for creating Figure 8 and Figure 9).
- Spreadsheet shows the scalability on the Starlight dataset (data for creating Figure 10).
- Spreadsheet shows the accuracy comparison to the Euclidean one-nearest-neighbor (data for creating Figure 11).
- Spreadsheet shows the trivial effect of parameters r and k (data for creating Figure 12).

PAMAP Dataset

Physical
Activity Monitoring for Aging People (PAMAP) dataset is a time series
dataset recorded when subjects perform different activities such as
walking, running, jumping, etc. The dataset can be found at www.pamap.org. All outdoor activities from the original PAMAP dataset, shown below on the right, is used in Section 5 in the paper.

*Figure from www.pamap.org. |

- The general information of PAMAP dataset can be found here (or from the original wegpage).
- Description of the data collection protocal is provided here (or from the original wegpage).
- An example of our shapelet tree created from our algorithm is here.