License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.25
URN: urn:nbn:de:0030-drops-170598
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17059/
Go to the corresponding LIPIcs Volume Portal


Agret, Clément ; Cazaux, Bastien ; Limasset, Antoine

Toward Optimal Fingerprint Indexing for Large Scale Genomics

pdf-format:
LIPIcs-WABI-2022-25.pdf (1 MB)


Abstract

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.
Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.

BibTeX - Entry

@InProceedings{agret_et_al:LIPIcs.WABI.2022.25,
  author =	{Agret, Cl\'{e}ment and Cazaux, Bastien and Limasset, Antoine},
  title =	{{Toward Optimal Fingerprint Indexing for Large Scale Genomics}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{25:1--25:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2022/17059},
  URN =		{urn:nbn:de:0030-drops-170598},
  doi =		{10.4230/LIPIcs.WABI.2022.25},
  annote =	{Keywords: Data Structure, Indexation, Local Sensitive Hashing, Genomes, Databases}
}

Keywords: Data Structure, Indexation, Local Sensitive Hashing, Genomes, Databases
Collection: 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Issue Date: 2022
Date of publication: 26.08.2022
Supplementary Material: We wrote the NIQKI index as an open-source C++ library under the AGPL3 license. It is designed as a user-friendly tool and comes along with usage samples.
Software (Source Code): https://github.com/Malfoy/NIQKI archived at: https://archive.softwareheritage.org/swh:1:dir:4b130954e11adff2be9108f45c4181f972604407


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI