License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2021.21
URN: urn:nbn:de:0030-drops-139723
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/13972/
Go to the corresponding LIPIcs Volume Portal


Nishimoto, Takaaki ; Tabei, Yasuo

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

pdf-format:
LIPIcs-CPM-2021-21.pdf (1.0 MB)


Abstract

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.

BibTeX - Entry

@InProceedings{nishimoto_et_al:LIPIcs.CPM.2021.21,
  author =	{Nishimoto, Takaaki and Tabei, Yasuo},
  title =	{{R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{21:1--21:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/13972},
  URN =		{urn:nbn:de:0030-drops-139723},
  doi =		{10.4230/LIPIcs.CPM.2021.21},
  annote =	{Keywords: Enumeration algorithm, Burrows-Wheeler transform, Maximal repeats, Minimal unique substrings, Minimal absent words}
}

Keywords: Enumeration algorithm, Burrows-Wheeler transform, Maximal repeats, Minimal unique substrings, Minimal absent words
Collection: 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Issue Date: 2021
Date of publication: 30.06.2021
Supplementary Material: Software (Source Code): https://github.com/TNishimoto/renum archived at: https://archive.softwareheritage.org/swh:1:dir:cde748f235b355d80783d13b86eaa048b7c965ea


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI