Abstract
We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a wellstudied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(n^m) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2approximation in Õ_m(n^⌈m/2⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(n^{α m}) time for any α < 1. We make significant progress in this direction.
First, we consider a semirandom model where, we show if just one out of m sequences is (p,B)pseudorandom then, we can get a belowtwo approximation in Õ_m(nB^{m1}+n^{⌊m/2⌋+3}) time. Such semirandom models are very wellstudied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(1/p) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õ_m(n^{⌊m/2⌋+2}) time. This could be of independent interest.
Next, for LCS of m sequences each of length n, we show if the optimum LCS is λ n for some λ ∈ [0,1], then in Õ_m(n^{⌊m/2⌋+1}) time, we can return a common subsequence of length at least λ²n/(2+ε) for any arbitrary constant ε > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ⁴ n).
BibTeX  Entry
@InProceedings{das_et_al:LIPIcs.APPROX/RANDOM.2022.54,
author = {Das, Debarati and Saha, Barna},
title = {{Approximating LCS and Alignment Distance over Multiple Sequences}},
booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)},
pages = {54:154:21},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959772495},
ISSN = {18688969},
year = {2022},
volume = {245},
editor = {Chakrabarti, Amit and Swamy, Chaitanya},
publisher = {Schloss Dagstuhl  LeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17176},
URN = {urn:nbn:de:0030drops171762},
doi = {10.4230/LIPIcs.APPROX/RANDOM.2022.54},
annote = {Keywords: String Algorithms, Approximation Algorithms}
}
Keywords: 

String Algorithms, Approximation Algorithms 
Collection: 

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022) 
Issue Date: 

2022 
Date of publication: 

15.09.2022 