License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2020.16
URN: urn:nbn:de:0030-drops-128057
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2020/12805/
Go to the corresponding LIPIcs Volume Portal


Rahman, Amatur ; Chikhi, Rayan ; Medvedev, Paul

Disk Compression of k-mer Sets

pdf-format:
LIPIcs-WABI-2020-16.pdf (1 MB)


Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

BibTeX - Entry

@InProceedings{rahman_et_al:LIPIcs:2020:12805,
  author =	{Amatur Rahman and Rayan Chikhi and Paul Medvedev},
  title =	{{Disk Compression of k-mer Sets}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{16:1--16:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Carl Kingsford and Nadia Pisanti},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2020/12805},
  URN =		{urn:nbn:de:0030-drops-128057},
  doi =		{10.4230/LIPIcs.WABI.2020.16},
  annote =	{Keywords: de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets}
}

Keywords: de Bruijn graphs, compression, k-mer sets, spectrum-preserving string sets
Collection: 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)
Issue Date: 2020
Date of publication: 25.08.2020
Supplementary Material: Software available at http://github.com/medvedevgroup/ESSCompress.


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI