MASS: Mueen's Algorithm for Similarity Search
The Fastest Similarity Search Algorithm for Time
Series
Subsequences under Euclidean Distance and Correlation Coefficient
Similarity search for time series subsequences is THE most important
subroutine for time series pattern mining. Subsequence similarity
search has been scaled to trillions obsetvations under both DTW
(Dynamic Time Warping) and Euclidean distances [a]. The algorithms are
ultra fast and efficient. The key technique that makes the
algorithms useful
is the Early Abandoning technique [b,e] known since 1994.
However, the algorithms lack few properties
that are useful for many time series data mining
algorithms.
1. Early abandoning
depends on the dataset. The worst case complexity
is still O(nm)
where n is
the length of the larger time series and m is the length of
the short query.
2. The algorithm can
produce the most similar subsequence to the query
and cannot
produce the Distance Profile to all the
subssequences given the query.
MASS is an algorithm to create Distance Profile of a query to a
long time series. In this page we share a code for The Fastest
Similarity Search
Algorithm for Time Series Subsequences under Euclidean Distance. Early
abandoning can occasionally beat this algorithm on some datasets for
some queries. This algorithm is independent of data and query. The
underlying concept of the algorithm is known for a long time to the
signal processing community. We have used it for the first time on time
series subsequence search under z-normalization. The
algorithm was used as a subroutine in our papers [c,d] and the code are
given below.
1.
The algorithm has an overall time complexity of O(n log n) which does
not depend on datasets and is the lower bound of similarity
search
over time series subsequences.
2.
The algorithm produces all of the distances from the query
to the
subsequences of a long time series. In our recent paper, we generalize
the usage of the distance profiles calculated using MASS in finding
motifs, shapelets and discords. Check out the
paper here.
To cite this code:
Abdullah Mueen,
Yan Zhu, Michael Yeh, Kaveh Kamgar, Krishnamurthy Viswanathan, Chetan Kumar Gupta and Eamonn Keogh
(2015), The Fastest Similarity Search Algorithm for Time Series
Subsequences under Euclidean Distance,
URL: http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html
or use the following bib snippet.
@misc{FastestSimilaritySearch,
title={The Fastest Similarity Search Algorithm for Time Series
Subsequences under Euclidean Distance},
author={ Mueen, Abdullah and Zhu, Yan and Yeh, Michael and Kamgar, Kaveh and Viswanathan, Krishnamurthy and Gupta,
Chetan and Keogh, Eamonn},
year={2017},
month={August},
note = {\url{http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html}}
}
Code:
The first version of MASS was developed in 2011 and published in this page in 2015. It was named findNN initially and later renamed as MASS.
The second version of MASS_V2 was developed in 2017 which is more than 2X faster than the first version. Several people contributed to this code including Yan Zhu, Michael Yeh, Kaveh Kamgar and Eamonn Keogh from UCR.
The third version of MASS_V3 is a piecewise version of MASS that performs better when the size of the pieces are well aligned with the hardware.
MASS_weighted is an extension to create weighted distance profiles.
MASS_absolute is a simplification of the original MASS algoeithm to search for absolute matches without normalizing the query and matches.
A python version written by Adnan Ibn Khair is available here. Another version written by Tyler Marrs is available here.
A C++ version is available here.
If you have MATLAB Parallel Computing Toolbox and a GPU installed, MASS can be much faster for you. Consider the following few lines.
>> tic, MASS_V3(gpuArray(randn(100000000,1)),gpuArray(rand(400,1)),2^23); toc;
Elapsed time is 1.600559 seconds.
>> tic, MASS_V3(randn(100000000,1),rand(400,1),2^23); toc;
Elapsed time is 8.881304 seconds.
If you find this code useful for your research, please drop me a line, I would love to know the impact of this code.
Sample Case Studies using MASS:
These slides are created by Prof. Eamonn Keogh (many thanks).
Simple Case Studies
Datasets for the case studies.
Penguin
Robot
Carpet
Tutorial:
MASS has been featured in several tutorials presented in top data mining venues. A self-contained version is available here.
[a]
Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E.
A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn
J. Keogh: Searching and mining trillions of time series subsequences
under dynamic time warping. KDD 2012: 262-270
[b] Christos
Faloutsos, M. Ranganathan, Yannis Manolopoulos: Fast Subsequence
Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429
[c] Abdullah Mueen, Hossein Hamooni, Trilce Estrada: Time Series Join
on Subsequence Correlation. ICDM 2014: 450-459
[d] Jesin Zakaria, Abdullah Mueen, Eamonn J. Keogh: Clustering Time
Series Using Unsupervised-Shapelets. ICDM 2012: 785-794
[e]
Eamonn J. Keogh, Shruti Kasetty: On the need for time series data
mining benchmarks: a survey and empirical demonstration. KDD 2002:
102-111
Research
and code development funded by NSF IIS - 1161997, NSF SHF 1527127 and an internship at
HP Labs.