License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2022.16
URN: urn:nbn:de:0030-drops-161439
Go to the corresponding LIPIcs Volume Portal

Bernardini, Giulia ; Conte, Alessio ; Gabory, Esteban ; Grossi, Roberto ; Loukides, Grigorios ; Pissis, Solon P. ; Punzi, Giulia ; Sweering, Michelle

On Strings Having the Same Length- k Substrings

LIPIcs-CPM-2022-16.pdf (0.9 MB)


Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest ?_k-Equivalent Strings: Given a set ?_k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = ?_k, for all i ∈ [1,z]. The z-Shortest ?_k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest ?_k-Equivalent Strings, referred to as Shortest ?_k-Equivalent String, asks for a shortest string X such that Substr_k(X) = ?_k.
Our main contributions are summarized below:
- Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in ?̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest ?_k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP.
- We show that the length of a shortest string output by Shortest ?_k-Equivalent String is in ?(k+n²). We generalize this bound by showing that the total length of z shortest strings is in ?(zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs.
- We present an algorithm for solving z-Shortest ?_k-Equivalent Strings in ?(nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes ?(nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is ?(k+n²).

BibTeX - Entry

  author =	{Bernardini, Giulia and Conte, Alessio and Gabory, Esteban and Grossi, Roberto and Loukides, Grigorios and Pissis, Solon P. and Punzi, Giulia and Sweering, Michelle},
  title =	{{On Strings Having the Same Length- k Substrings}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{16:1--16:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-161439},
  doi =		{10.4230/LIPIcs.CPM.2022.16},
  annote =	{Keywords: string algorithms, combinatorics on words, de Bruijn graph, Chinese Postman}

Keywords: string algorithms, combinatorics on words, de Bruijn graph, Chinese Postman
Collection: 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)
Issue Date: 2022
Date of publication: 22.06.2022

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI