MASS: Mueen's Algorithm for Similarity Search

The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance

Similarity search for time series subseqiences is THE most important subroutine for time series pattern mining. Subsequence similarity search has been scaled to trillions obsetvations under both DTW (Dynamic Time Warping) and Euclidean distances [a]. The algorithms are super fast and efficient. The key technique that makes the algorithms useful is the Early Abandoning technique [b,e] known for over a decade. However, the algorithms lack few properties that are useful for many time series data mining algorithms.

1. Early abandoning depends on the dataset. The worst case complexity is still O(nm) where n is the length of the larger time series and m is the length of the short query.
2. The algorithm can produce the most similar subsequence to the query and cannot produce any information on the distribution of the distances to all the subssequences.

In this page we share a code for The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance. Early abandoning can occasionally beat this algorithm on some datasets for some queries. This algorithm is independent of data and query. The underlying concept of the algorithm is known for a long time to the signal processing community. We have used it for the first time on time series subsequence search under normalization. The algorithm was used as a subroutine in our papers [c,d] and the code is given below.

1. The algorithm has an overall time complexity of O(n log n) which does not depend on datasets and is the lower bound of similarity search over time series subsequences.
2. The algorithm produces all of the distances from the query to the subsequences of a long time series.


To cite  this code:

Abdullah Mueen, Krishnamurthy Viswanathan, Chetan Kumar Gupta and Eamonn Keogh (2015), The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance, URL: http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html

or use the following bib snippet.

@misc{FastestSimilaritySearch,
title={The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance},
author={ Mueen, Abdullah and Viswanathan, Krishnamurthy and Gupta, Chetan and Keogh, Eamonn},
year={2015},
month={August},
note = {\url{http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html}}
}



If you find this code useful for your research, please drop me a line, I would love to know the impact of this code.

[a] Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh: Searching and mining trillions of time series subsequences under dynamic time warping. KDD 2012: 262-270
[b] Christos Faloutsos, M. Ranganathan, Yannis Manolopoulos: Fast Subsequence Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429
[c] Abdullah Mueen, Hossein Hamooni, Trilce Estrada: Time Series Join on Subsequence Correlation. ICDM 2014: 450-459
[d] Jesin Zakaria, Abdullah Mueen, Eamonn J. Keogh: Clustering Time Series Using Unsupervised-Shapelets. ICDM 2012: 785-794
[e] Eamonn J. Keogh, Shruti Kasetty: On the need for time series data mining benchmarks: a survey and empirical demonstration. KDD 2002: 102-111



Research and code development funded by NSF IIS - 1161997 and an internship at HP Labs.