License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2023.18
URN: urn:nbn:de:0030-drops-186446
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2023/18644/
Go to the corresponding LIPIcs Volume Portal


Fan, Jason ; Singh, Noor Pratap ; Khan, Jamshed ; Pibiri, Giulio Ermanno ; Patro, Rob

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

pdf-format:
LIPIcs-WABI-2023-18.pdf (2 MB)


Abstract

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections.
To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space.
We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.

BibTeX - Entry

@InProceedings{fan_et_al:LIPIcs.WABI.2023.18,
  author =	{Fan, Jason and Singh, Noor Pratap and Khan, Jamshed and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fulgor: A Fast and Compact \{k-mer\} Index for Large-Scale Matching and Color Queries}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{18:1--18:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/18644},
  URN =		{urn:nbn:de:0030-drops-186446},
  doi =		{10.4230/LIPIcs.WABI.2023.18},
  annote =	{Keywords: k-mers, Colored de Bruijn Graph, Compression, Read-mapping}
}

Keywords: k-mers, Colored de Bruijn Graph, Compression, Read-mapping
Collection: 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Issue Date: 2023
Date of publication: 29.08.2023
Supplementary Material: Software (Source Code): https://github.com/jermp/fulgor
Software (Source Code): https://github.com/jermp/fulgor-benchmarks


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI