DROPS - Document

License:

Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.SEA.2023.7
URN: urn:nbn:de:0030-drops-183579
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18357/

Go to the corresponding LIPIcs Volume Portal

Díaz-Domínguez, Diego ; Dönges, Saska ; Puglisi, Simon J. ; Salmela, Leena

Simple Runs-Bounded FM-Index Designs Are Fast

pdf-format:

LIPIcs-SEA-2023-7.pdf (1 MB)

Abstract

Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

BibTeX - Entry

@InProceedings{diazdominguez_et_al:LIPIcs.SEA.2023.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena},
  title =	{{Simple Runs-Bounded FM-Index Designs Are Fast}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{7:1--7:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/18357},
  URN =		{urn:nbn:de:0030-drops-183579},
  doi =		{10.4230/LIPIcs.SEA.2023.7},
  annote =	{Keywords: data structures, efficient algorithms}
}

Keywords: data structures, efficient algorithms

Collection: 21st International Symposium on Experimental Algorithms (SEA 2023)

Issue Date: 2023

Date of publication: 19.07.2023

Supplementary Material: Software (Source Code): https://github.com/saskeli/block_RLBWT archived at: https://archive.softwareheritage.org/swh:1:dir:244fd876cd15bd291d91e444c8c04bb289aed05f

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI

Keywords:		data structures, efficient algorithms
Collection:		21st International Symposium on Experimental Algorithms (SEA 2023)
Issue Date:		2023
Date of publication:		19.07.2023
Supplementary Material:		Software (Source Code): https://github.com/saskeli/block_RLBWT archived at: https://archive.softwareheritage.org/swh:1:dir:244fd876cd15bd291d91e444c8c04bb289aed05f