DROPS - Document

License:

Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ESA.2023.40
URN: urn:nbn:de:0030-drops-186935
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18693/

Go to the corresponding LIPIcs Volume Portal

Ding, Xiangyun ; Dong, Xiaojun ; Gu, Yan ; Liu, Youzhe ; Sun, Yihan

Efficient Parallel Output-Sensitive Edit Distance

pdf-format:

LIPIcs-ESA-2023-40.pdf (2 MB)

Abstract

In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space.
We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.

BibTeX - Entry

@InProceedings{ding_et_al:LIPIcs.ESA.2023.40,
  author =	{Ding, Xiangyun and Dong, Xiaojun and Gu, Yan and Liu, Youzhe and Sun, Yihan},
  title =	{{Efficient Parallel Output-Sensitive Edit Distance}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{40:1--40:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/18693},
  URN =		{urn:nbn:de:0030-drops-186935},
  doi =		{10.4230/LIPIcs.ESA.2023.40},
  annote =	{Keywords: Edit Distance, Parallel Algorithms, String Algorithms, Dynamic Programming, Pattern Matching}
}

Keywords: Edit Distance, Parallel Algorithms, String Algorithms, Dynamic Programming, Pattern Matching

Collection: 31st Annual European Symposium on Algorithms (ESA 2023)

Issue Date: 2023

Date of publication: 30.08.2023

Supplementary Material: Software (Source Code): https://github.com/ucrparlay/Edit-Distance archived at: https://archive.softwareheritage.org/swh:1:dir:f5b361f279f694fcde074f24f5e11fb41bddf96e

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI

Keywords:		Edit Distance, Parallel Algorithms, String Algorithms, Dynamic Programming, Pattern Matching
Collection:		31st Annual European Symposium on Algorithms (ESA 2023)
Issue Date:		2023
Date of publication:		30.08.2023
Supplementary Material:		Software (Source Code): https://github.com/ucrparlay/Edit-Distance archived at: https://archive.softwareheritage.org/swh:1:dir:f5b361f279f694fcde074f24f5e11fb41bddf96e