License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2020.16
URN: urn:nbn:de:0030-drops-128057
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2020/12805/
Rahman, Amatur ;
Chikhi, Rayan ;
Medvedev, Paul
Disk Compression of k-mer Sets
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
BibTeX - Entry
@InProceedings{rahman_et_al:LIPIcs:2020:12805,
author = {Amatur Rahman and Rayan Chikhi and Paul Medvedev},
title = {{Disk Compression of k-mer Sets}},
booktitle = {20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
pages = {16:1--16:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-161-0},
ISSN = {1868-8969},
year = {2020},
volume = {172},
editor = {Carl Kingsford and Nadia Pisanti},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/12805},
URN = {urn:nbn:de:0030-drops-128057},
doi = {10.4230/LIPIcs.WABI.2020.16},
annote = {Keywords: de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets}
}
Keywords: |
|
de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets |
Collection: |
|
20th International Workshop on Algorithms in Bioinformatics (WABI 2020) |
Issue Date: |
|
2020 |
Date of publication: |
|
25.08.2020 |
Supplementary Material: |
|
Software available at http://github.com/medvedevgroup/ESSCompress. |