License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2022.25
URN: urn:nbn:de:0030-drops-161529
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/16152/
Cenzato, Davide ;
Lipták, Zsuzsanna
A Theoretical and Experimental Analysis of BWT Variants for String Collections
Abstract
The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.
BibTeX - Entry
@InProceedings{cenzato_et_al:LIPIcs.CPM.2022.25,
author = {Cenzato, Davide and Lipt\'{a}k, Zsuzsanna},
title = {{A Theoretical and Experimental Analysis of BWT Variants for String Collections}},
booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
pages = {25:1--25:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-234-1},
ISSN = {1868-8969},
year = {2022},
volume = {223},
editor = {Bannai, Hideo and Holub, Jan},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/16152},
URN = {urn:nbn:de:0030-drops-161529},
doi = {10.4230/LIPIcs.CPM.2022.25},
annote = {Keywords: Burrows-Wheeler-Transform, extended BWT, string collections, repetitiveness measures, r, compression}
}
Keywords: |
|
Burrows-Wheeler-Transform, extended BWT, string collections, repetitiveness measures, r, compression |
Collection: |
|
33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022) |
Issue Date: |
|
2022 |
Date of publication: |
|
22.06.2022 |
Supplementary Material: |
|
Software (Source Code and Data): https://github.com/davidecenzato/BWT-variants-for-string-collections |