License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2022.25
URN: urn:nbn:de:0030-drops-161529
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/16152/
Go to the corresponding LIPIcs Volume Portal


Cenzato, Davide ; Lipták, Zsuzsanna

A Theoretical and Experimental Analysis of BWT Variants for String Collections

pdf-format:
LIPIcs-CPM-2022-25.pdf (2 MB)


Abstract

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.

BibTeX - Entry

@InProceedings{cenzato_et_al:LIPIcs.CPM.2022.25,
  author =	{Cenzato, Davide and Lipt\'{a}k, Zsuzsanna},
  title =	{{A Theoretical and Experimental Analysis of BWT Variants for String Collections}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{25:1--25:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2022/16152},
  URN =		{urn:nbn:de:0030-drops-161529},
  doi =		{10.4230/LIPIcs.CPM.2022.25},
  annote =	{Keywords: Burrows-Wheeler-Transform, extended BWT, string collections, repetitiveness measures, r, compression}
}

Keywords: Burrows-Wheeler-Transform, extended BWT, string collections, repetitiveness measures, r, compression
Collection: 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)
Issue Date: 2022
Date of publication: 22.06.2022
Supplementary Material: Software (Source Code and Data): https://github.com/davidecenzato/BWT-variants-for-string-collections


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI