License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2021.19
URN: urn:nbn:de:0030-drops-139709
Go to the corresponding LIPIcs Volume Portal

Mieno, Takuya ; Pissis, Solon P. ; Stougie, Leen ; Sweering, Michelle

String Sanitization Under Edit Distance: Improved and Generalized

LIPIcs-CPM-2021-19.pdf (1.0 MB)


Let W be a string of length n over an alphabet Σ, k be a positive integer, and ? be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of ? occurs in X_ED; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and ? represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in ?(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in ?(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
- An ?(n²log²k)-time algorithm to solve ETFS.
- An ?(n²log²n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of ? can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails.
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in ?̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.

BibTeX - Entry

  author =	{Mieno, Takuya and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},
  title =	{{String Sanitization Under Edit Distance: Improved and Generalized}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{19:1--19:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-139709},
  doi =		{10.4230/LIPIcs.CPM.2021.19},
  annote =	{Keywords: string algorithms, data sanitization, edit distance, dynamic programming}

Keywords: string algorithms, data sanitization, edit distance, dynamic programming
Collection: 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Issue Date: 2021
Date of publication: 30.06.2021

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI