License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ESA.2023.83
URN: urn:nbn:de:0030-drops-187360
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18736/
Go to the corresponding LIPIcs Volume Portal


Masillo, Francesco

Matching Statistics Speed up BWT Construction

pdf-format:
LIPIcs-ESA-2023-83.pdf (0.8 MB)


Abstract

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. In particular, the Burrows-Wheeler Transform (BWT) is the basis of some of the most popular self-indexes for genomic data, due to its known favourable behaviour on repetitive data.
Some tools that exploit the intrinsic repetitiveness of biological data have risen in popularity, due to their speed and low space consumption. We introduce a new algorithm for computing the BWT, which takes advantage of the redundancy of the data through a compressed version of matching statistics, the CMS of [Lipták et al., WABI 2022]. We show that it suffices to sort a small subset of suffixes, lowering both computation time and space. Our result is due to a new insight which links the so-called insert-heads of [Lipták et al., WABI 2022] to the well-known run boundaries of the BWT.
We give two implementations of our algorithm, called CMS-BWT, both competitive in our experimental validation on highly repetitive real-life datasets. In most cases, they outperform other tools w.r.t. running time, trading off a higher memory footprint, which, however, is still considerably smaller than the total size of the input data.

BibTeX - Entry

@InProceedings{masillo:LIPIcs.ESA.2023.83,
  author =	{Masillo, Francesco},
  title =	{{Matching Statistics Speed up BWT Construction}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{83:1--83:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/18736},
  URN =		{urn:nbn:de:0030-drops-187360},
  doi =		{10.4230/LIPIcs.ESA.2023.83},
  annote =	{Keywords: Burrows-Wheeler Transform, matching statistics, string collections, compressed representation, data structures, efficient algorithms}
}

Keywords: Burrows-Wheeler Transform, matching statistics, string collections, compressed representation, data structures, efficient algorithms
Collection: 31st Annual European Symposium on Algorithms (ESA 2023)
Issue Date: 2023
Date of publication: 30.08.2023
Supplementary Material: Software (Source Code): https://github.com/fmasillo/CMS-BWT


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI