License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2022.16
URN: urn:nbn:de:0030-drops-161439
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/16143/
Bernardini, Giulia ;
Conte, Alessio ;
Gabory, Esteban ;
Grossi, Roberto ;
Loukides, Grigorios ;
Pissis, Solon P. ;
Punzi, Giulia ;
Sweering, Michelle
On Strings Having the Same Length- k Substrings
Abstract
Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest ?_k-Equivalent Strings: Given a set ?_k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = ?_k, for all i ∈ [1,z]. The z-Shortest ?_k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest ?_k-Equivalent Strings, referred to as Shortest ?_k-Equivalent String, asks for a shortest string X such that Substr_k(X) = ?_k.
Our main contributions are summarized below:
- Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in ?̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest ?_k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP.
- We show that the length of a shortest string output by Shortest ?_k-Equivalent String is in ?(k+n²). We generalize this bound by showing that the total length of z shortest strings is in ?(zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs.
- We present an algorithm for solving z-Shortest ?_k-Equivalent Strings in ?(nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes ?(nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is ?(k+n²).
BibTeX - Entry
@InProceedings{bernardini_et_al:LIPIcs.CPM.2022.16,
author = {Bernardini, Giulia and Conte, Alessio and Gabory, Esteban and Grossi, Roberto and Loukides, Grigorios and Pissis, Solon P. and Punzi, Giulia and Sweering, Michelle},
title = {{On Strings Having the Same Length- k Substrings}},
booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
pages = {16:1--16:17},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-234-1},
ISSN = {1868-8969},
year = {2022},
volume = {223},
editor = {Bannai, Hideo and Holub, Jan},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/16143},
URN = {urn:nbn:de:0030-drops-161439},
doi = {10.4230/LIPIcs.CPM.2022.16},
annote = {Keywords: string algorithms, combinatorics on words, de Bruijn graph, Chinese Postman}
}
Keywords: |
|
string algorithms, combinatorics on words, de Bruijn graph, Chinese Postman |
Collection: |
|
33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022) |
Issue Date: |
|
2022 |
Date of publication: |
|
22.06.2022 |