License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2023.21
URN: urn:nbn:de:0030-drops-179757
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/17975/
Go to the corresponding LIPIcs Volume Portal


Loukides, Grigorios ; Pissis, Solon P. ; Thankachan, Sharma V. ; Zuba, Wiktor

Suffix-Prefix Queries on a Dictionary

pdf-format:
LIPIcs-CPM-2023-21.pdf (1 MB)


Abstract

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S_1,…,S_k, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1,k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ n^?(1), APSP can be solved in the optimal ?(n+k²) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992].
In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k² usually dominates n, and thus the k² factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries:
- One-to-One(i,j): output SPL_{i,j}.
- One-to-All(i): output SPL_{i,j} for every j ∈ [1,k].
- Report(i,?): output all distinct j ∈ [1,k] such that SPL_{i,j} ≥ ?, where ? ≥ 0 is an integer.
- Count(i,?): output the number of distinct j ∈ [1,k] such that SPL_{i,j} ≥ ?, where ? ≥ 0 is an integer.
- Top(i,K): output K distinct j ∈ [1,k] with the highest values of SPL_{i,j} breaking ties arbitrarily.
We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ n^?(1). We show the following upper bounds:

Query | Space (words) | Query time | Note
One-to-One(i,j) | ?(n) | ?(log log k) | Theorem 11
One-to-All(i) | ?(n) | ?(k) | Theorem 14
Report(i,?) | ?(n) | ?(log n/log log n+output) | Theorem 19(i)
Count(i,?) | ?(n) | ?(log n/log log n) | Theorem 19(ii)
Top(i,K) | ?(n) | ?(log² n/log log n+K) | Theorem 22

We also present efficient algorithms for constructing these data structures.

BibTeX - Entry

@InProceedings{loukides_et_al:LIPIcs.CPM.2023.21,
  author =	{Loukides, Grigorios and Pissis, Solon P. and Thankachan, Sharma V. and Zuba, Wiktor},
  title =	{{Suffix-Prefix Queries on a Dictionary}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{21:1--21:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/17975},
  URN =		{urn:nbn:de:0030-drops-179757},
  doi =		{10.4230/LIPIcs.CPM.2023.21},
  annote =	{Keywords: all-pairs suffix-prefix, suffix-prefix queries, internal pattern matching}
}

Keywords: all-pairs suffix-prefix, suffix-prefix queries, internal pattern matching
Collection: 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Issue Date: 2023
Date of publication: 21.06.2023


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI