Abstract
An elasticdegenerate (ED) string T is a sequence of n sets T[1],…,T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as ℒ(T) = {S_1 ⋯ S_n : S_i ∈ T[i] for all i ∈ [1,n]}. ED strings have been introduced to represent a set of closelyrelated DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T₁ and T₂ of lengths n₁ and n₂, cardinalities m₁ and m₂, and sizes N₁ and N₂, respectively, we show the following:
 There is no ?((N₁N₂)^{1ε})time algorithm, thus no ?((N₁m₂+N₂m₁)^{1ε})time algorithm and no ?((N₁n₂+N₂n₁)^{1ε})time algorithm, for any constant ε > 0, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Strong ExponentialTime Hypothesis is false.
 There is no combinatorial ?((N₁+N₂)^{1.2ε}f(n₁,n₂))time algorithm, for any constant ε > 0 and any function f, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false.
 An ?(N₁log N₁log n₁+N₂log N₂log n₂)time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T₁ and T₂ are given in a compact representation, we show that the problem is NPcomplete.
 An ?(N₁m₂+N₂m₁)time algorithm for EDSI.
 An Õ(N₁^{ω1}n₂+N₂^{ω1}n₁)time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size.
We also show that the techniques we develop have applications outside of ED string comparison.
BibTeX  Entry
@InProceedings{gabory_et_al:LIPIcs.CPM.2023.11,
author = {Gabory, Esteban and Mwaniki, Moses Njagi and Pisanti, Nadia and Pissis, Solon P. and Radoszewski, Jakub and Sweering, Michelle and Zuba, Wiktor},
title = {{Comparing ElasticDegenerate Strings: Algorithms, Lower Bounds, and Applications}},
booktitle = {34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
pages = {11:111:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959772761},
ISSN = {18688969},
year = {2023},
volume = {259},
editor = {Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
publisher = {Schloss Dagstuhl  LeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/17965},
URN = {urn:nbn:de:0030drops179650},
doi = {10.4230/LIPIcs.CPM.2023.11},
annote = {Keywords: elasticdegenerate string, sequence comparison, languages intersection, pangenome, acronym identification}
}
Keywords: 

elasticdegenerate string, sequence comparison, languages intersection, pangenome, acronym identification 
Collection: 

34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023) 
Issue Date: 

2023 
Date of publication: 

21.06.2023 