License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.9
URN: urn:nbn:de:0030-drops-170439
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17043/
Pibiri, Giulio Ermanno
On Weighted k-mer Dictionaries
Abstract
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing.
In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.
BibTeX - Entry
@InProceedings{pibiri:LIPIcs.WABI.2022.9,
author = {Pibiri, Giulio Ermanno},
title = {{On Weighted k-mer Dictionaries}},
booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
pages = {9:1--9:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-243-3},
ISSN = {1868-8969},
year = {2022},
volume = {242},
editor = {Boucher, Christina and Rahmann, Sven},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17043},
URN = {urn:nbn:de:0030-drops-170439},
doi = {10.4230/LIPIcs.WABI.2022.9},
annote = {Keywords: K-Mers, Weights, Compression, Hashing}
}
Keywords: |
|
K-Mers, Weights, Compression, Hashing |
Collection: |
|
22nd International Workshop on Algorithms in Bioinformatics (WABI 2022) |
Issue Date: |
|
2022 |
Date of publication: |
|
26.08.2022 |
Supplementary Material: |
|
Software (Source Code): https://github.com/jermp/sshash |