TY - JOUR
T1 - GoSCAN: Decentralized scalable data clustering
AU - Mashayekhi, H.
AU - Habibi, J.
AU - Voulgaris, S.
AU - van Steen, M.R.
PY - 2013
Y1 - 2013
N2 - Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost. © 2012 Springer-Verlag Wien.
AB - Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost. © 2012 Springer-Verlag Wien.
U2 - 10.1007/s00607-012-0264-2
DO - 10.1007/s00607-012-0264-2
M3 - Article
SN - 0010-485X
VL - 95
SP - 759
EP - 784
JO - Computing
JF - Computing
IS - 9
ER -