License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.SEA.2023.7
URN: urn:nbn:de:0030-drops-183579
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18357/
Díaz-Domínguez, Diego ;
Dönges, Saska ;
Puglisi, Simon J. ;
Salmela, Leena
Simple Runs-Bounded FM-Index Designs Are Fast
Abstract
Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.
BibTeX - Entry
@InProceedings{diazdominguez_et_al:LIPIcs.SEA.2023.7,
author = {D{\'\i}az-Dom{\'\i}nguez, Diego and D\"{o}nges, Saska and Puglisi, Simon J. and Salmela, Leena},
title = {{Simple Runs-Bounded FM-Index Designs Are Fast}},
booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)},
pages = {7:1--7:16},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-279-2},
ISSN = {1868-8969},
year = {2023},
volume = {265},
editor = {Georgiadis, Loukas},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/18357},
URN = {urn:nbn:de:0030-drops-183579},
doi = {10.4230/LIPIcs.SEA.2023.7},
annote = {Keywords: data structures, efficient algorithms}
}