License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2021.13
URN: urn:nbn:de:0030-drops-139647
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/13964/
Go to the corresponding LIPIcs Volume Portal


Cobas, Dustin ; Gagie, Travis ; Navarro, Gonzalo

A Fast and Small Subsampled R-Index

pdf-format:
LIPIcs-CPM-2021-13.pdf (1 MB)


Abstract

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, ?(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to ?(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

BibTeX - Entry

@InProceedings{cobas_et_al:LIPIcs.CPM.2021.13,
  author =	{Cobas, Dustin and Gagie, Travis and Navarro, Gonzalo},
  title =	{{A Fast and Small Subsampled R-Index}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{13:1--13:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/13964},
  URN =		{urn:nbn:de:0030-drops-139647},
  doi =		{10.4230/LIPIcs.CPM.2021.13},
  annote =	{Keywords: Pattern matching, r-index, compressed text indexing, repetitive text collections}
}

Keywords: Pattern matching, r-index, compressed text indexing, repetitive text collections
Collection: 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Issue Date: 2021
Date of publication: 30.06.2021
Supplementary Material: Software (Source Code): https://github.com/duscob/sr-index archived at: https://archive.softwareheritage.org/swh:1:dir:edbc683faf60b1c3604e211cb464d6183822bb0a


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI