## Abstract

Let W be a string of length n over an alphabet S, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_{ED}such that: (i) no string of S occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_{ED}; and (iii) X_{ED}has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n^{2}k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: An O(n^{2}log^{2}k)-time algorithm to solve ETFS. An O(n^{2}log^{2}n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails. In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in O(n^{2}) total time1. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.

Original language | English |
---|---|

Title of host publication | 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021) |

Editors | Pawel Gawrychowski, Tatiana Starikovskaya |

Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |

Pages | 1-18 |

Number of pages | 18 |

ISBN (Print) | 9783959771863 |

DOIs | |

Publication status | Published - 2021 |

Event | 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021 - Wroclaw, Poland Duration: 5 Jul 2021 → 7 Jul 2021 |

### Publication series

Name | Leibniz International Proceedings in Informatics (LIPIcs) |
---|---|

Volume | 191 |

ISSN (Print) | 1868-8969 |

### Conference

Conference | 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021 |
---|---|

Country/Territory | Poland |

City | Wroclaw |

Period | 5/07/21 → 7/07/21 |

### Bibliographical note

Funding Information:Funding Takuya Mieno: JSPS KAKENHI Grant Number JP20J11983 Leen Stougie: Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003 Michelle Sweering: Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003

Publisher Copyright:

© 2021 Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. All rights reserved.

Copyright:

Copyright 2021 Elsevier B.V., All rights reserved.

## Keywords

- Data sanitization
- Dynamic programming
- Edit distance
- String algorithms