Abstract
Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.
Original language | English |
---|---|
Pages (from-to) | 5948-5963 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 35 |
Issue number | 6 |
DOIs | |
Publication status | Published - 2023 |
Bibliographical note
Publisher Copyright:IEEE
Funding
The work of Giulia Bernardini and Leen Stougie was supported by the Netherlands Organisation for Scientific Research (NWO) under Project OCENW. GROOT.2019.015 Optimization for and with Machine Learning (OPTIMAL). The work of Alessio Conte, Roberto Grossi, andNadia Pisanti was supported in part by the University of Pisa under the PRA - Progetti di Ricerca di Ateneo (Institutional ResearchGrants) - Project No. PRA 20202021 26 Metodi Informatici Integrati per la Biomedica. The work of Garance Gourdel was supported by the French National ResearchAgency (ANR) under GrantANR-20-CE48-000. Thework ofGrigorios Loukides was supported by the Leverhulme Trust under Project RPG-2019-399. The work of Nadia Pisanti, Solon P. Pissis, and Leen Stougiewas supported in part by the European Union’s Horizon 2020 Research and Innovation programme through ALPACA Project under the Marie Skłodowska-Curie Grant Agreement 956229. The work of Solon P. Pissis and Leen Stougiewas supported in part by the EuropeanUnion’s Horizon 2020 research and innovation programme through PANGAIA Project under the Marie Sklodowska-Curie Grant Agreement 872539. The work of Michelle Sweering and Leen Stougie was supported by the Netherlands Organisation for Scientific Research (NWO)throughGravitation-Grant NETWORKS-024.002.003
Funders | Funder number |
---|---|
EuropeanUnion’s Horizon 2020 research and innovation programme | 872539, NETWORKS-024.002.003 |
French National ResearchAgency | |
Horizon 2020 Framework Programme | |
H2020 Marie Skłodowska-Curie Actions | 956229 |
Leverhulme Trust | RPG-2019-399 |
Agence Nationale de la Recherche | |
Nederlandse Organisatie voor Wetenschappelijk Onderzoek | |
Università di Pisa |
Keywords
- Bioinformatics
- Data integrity
- Data mining
- Data privacy
- Data sanitization
- DNA
- Frequent pattern mining
- Genomics
- Knowledge hiding
- Privacy
- Resists
- String algorithms