TY - GEN
T1 - Twister tries
T2 - ACM SIGMOD International Conference on Management of Data, SIGMOD 2015
AU - Cochez, Michael
AU - Mou, Hao
PY - 2015/5/27
Y1 - 2015/5/27
N2 - Many commonly used data-mining techniques utilized across research fields perform poorly when used for large data sets. Sequential agglomerative hierarchical non-overlapping clustering is one technique for which the algorithms' scaling properties prohibit clustering of a large amount of items. Besides the unfavorable time complexity of O(n2), these algorithms have a space complexity of O(n2), which can be reduced to O(n) if the time complexity is allowed to rise to O(n2 log2 n). In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage. Our approach requires only linear space. Furthermore, its time complexity is linear in the number of items to be clustered, making it feasible to apply it on a larger scale. We evaluate the approach both analytically and by applying it to several data sets.
AB - Many commonly used data-mining techniques utilized across research fields perform poorly when used for large data sets. Sequential agglomerative hierarchical non-overlapping clustering is one technique for which the algorithms' scaling properties prohibit clustering of a large amount of items. Besides the unfavorable time complexity of O(n2), these algorithms have a space complexity of O(n2), which can be reduced to O(n) if the time complexity is allowed to rise to O(n2 log2 n). In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage. Our approach requires only linear space. Furthermore, its time complexity is linear in the number of items to be clustered, making it feasible to apply it on a larger scale. We evaluate the approach both analytically and by applying it to several data sets.
KW - Average linkage
KW - Hierarchical clustering
KW - Linear complexity
KW - Locality-sensitive hashing
UR - http://www.scopus.com/inward/record.url?scp=84955245016&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84955245016&partnerID=8YFLogxK
U2 - 10.1145/2723372.2751521
DO - 10.1145/2723372.2751521
M3 - Conference contribution
AN - SCOPUS:84955245016
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 505
EP - 517
BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 31 May 2015 through 4 June 2015
ER -