License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.SEA.2023.4
URN: urn:nbn:de:0030-drops-183549
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18354/
Alanko, Jarno N. ;
Biagi, Elena ;
Puglisi, Simon J. ;
Vuohtoniemi, Jaakko
Subset Wavelet Trees
Abstract
Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.
BibTeX - Entry
@InProceedings{alanko_et_al:LIPIcs.SEA.2023.4,
author = {Alanko, Jarno N. and Biagi, Elena and Puglisi, Simon J. and Vuohtoniemi, Jaakko},
title = {{Subset Wavelet Trees}},
booktitle = {21st International Symposium on Experimental Algorithms (SEA 2023)},
pages = {4:1--4:14},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-279-2},
ISSN = {1868-8969},
year = {2023},
volume = {265},
editor = {Georgiadis, Loukas},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/18354},
URN = {urn:nbn:de:0030-drops-183549},
doi = {10.4230/LIPIcs.SEA.2023.4},
annote = {Keywords: degenerate strings, compressed data structures, succinct data structures, string processing, data structures, efficient algorithms}
}
Keywords: |
|
degenerate strings, compressed data structures, succinct data structures, string processing, data structures, efficient algorithms |
Collection: |
|
21st International Symposium on Experimental Algorithms (SEA 2023) |
Issue Date: |
|
2023 |
Date of publication: |
|
19.07.2023 |
Supplementary Material: |
|
Software (Source Code): https://github.com/jnalanko/SubsetWT Software (Source Code): https://github.com/jnalanko/SubsetWT-Experiments |