License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.12
URN: urn:nbn:de:0030-drops-170467
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17046/
Zentgraf, Jens ;
Rahmann, Sven
Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables
Abstract
Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency.
Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers.
BibTeX - Entry
@InProceedings{zentgraf_et_al:LIPIcs.WABI.2022.12,
author = {Zentgraf, Jens and Rahmann, Sven},
title = {{Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables}},
booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
pages = {12:1--12:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-243-3},
ISSN = {1868-8969},
year = {2022},
volume = {242},
editor = {Boucher, Christina and Rahmann, Sven},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17046},
URN = {urn:nbn:de:0030-drops-170467},
doi = {10.4230/LIPIcs.WABI.2022.12},
annote = {Keywords: gapped k-mer, k-mer, counting, Cuckoo hashing, parallelization}
}