TY - JOUR
T1 - Mining maximal frequent patterns in transactional databases and dynamic data streams
T2 - A spark-based approach
AU - Karim, Md Rezaul
AU - Cochez, Michael
AU - Beyan, Oya Deniz
AU - Ahmed, Chowdhury Farhan
AU - Decker, Stefan
PY - 2018/3/1
Y1 - 2018/3/1
N2 - Mining maximal frequent patterns (MFPs) in transactional databases (TDBs) and dynamic data streams (DDSs) is substantially important for business intelligence. MFPs, as the smallest set of patterns, help to reveal customers’ purchase rules and market basket analysis (MBA). Although, numerous studies have been carried out in this area, most of them extend the main-memory based Apriori or FP-growth algorithms. Therefore, these approaches are not only unscalable but also lack parallelism. Consequently, ever increasing big data sources requirements cannot be met. In addition, mining performance in some existing approaches degrade drastically due to the presence of null transactions. We, therefore, proposed an efficient way to mining MFPs with Apache Spark to overcome these issues. For the faster computation and efficient utilization of memory, we utilized a prime number based data transformation technique, in which values of individual transaction have been preserved. After removing null transactions and infrequent items, the resulting transformed dataset becomes denser compared to the original distributions. We tested our proposed algorithms in both real static TDBs and DDSs. Experimental results and performance analysis show that our approach is efficient and scalable to large dataset sizes.
AB - Mining maximal frequent patterns (MFPs) in transactional databases (TDBs) and dynamic data streams (DDSs) is substantially important for business intelligence. MFPs, as the smallest set of patterns, help to reveal customers’ purchase rules and market basket analysis (MBA). Although, numerous studies have been carried out in this area, most of them extend the main-memory based Apriori or FP-growth algorithms. Therefore, these approaches are not only unscalable but also lack parallelism. Consequently, ever increasing big data sources requirements cannot be met. In addition, mining performance in some existing approaches degrade drastically due to the presence of null transactions. We, therefore, proposed an efficient way to mining MFPs with Apache Spark to overcome these issues. For the faster computation and efficient utilization of memory, we utilized a prime number based data transformation technique, in which values of individual transaction have been preserved. After removing null transactions and infrequent items, the resulting transformed dataset becomes denser compared to the original distributions. We tested our proposed algorithms in both real static TDBs and DDSs. Experimental results and performance analysis show that our approach is efficient and scalable to large dataset sizes.
KW - Apache Spark
KW - Big data
KW - Data mining
KW - Dynamic data streams
KW - Maximal frequent patterns
KW - Null transactions
KW - Prime number theory
KW - Transactional databases
UR - http://www.scopus.com/inward/record.url?scp=85038209276&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85038209276&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2017.11.064
DO - 10.1016/j.ins.2017.11.064
M3 - Article
AN - SCOPUS:85038209276
VL - 432
SP - 278
EP - 300
JO - Information Sciences
JF - Information Sciences
SN - 0020-0255
ER -