License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.25
URN: urn:nbn:de:0030-drops-170598
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17059/
Agret, Clément ;
Cazaux, Bastien ;
Limasset, Antoine
Toward Optimal Fingerprint Indexing for Large Scale Genomics
Abstract
Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.
Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.
BibTeX - Entry
@InProceedings{agret_et_al:LIPIcs.WABI.2022.25,
author = {Agret, Cl\'{e}ment and Cazaux, Bastien and Limasset, Antoine},
title = {{Toward Optimal Fingerprint Indexing for Large Scale Genomics}},
booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
pages = {25:1--25:15},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-243-3},
ISSN = {1868-8969},
year = {2022},
volume = {242},
editor = {Boucher, Christina and Rahmann, Sven},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17059},
URN = {urn:nbn:de:0030-drops-170598},
doi = {10.4230/LIPIcs.WABI.2022.25},
annote = {Keywords: Data Structure, Indexation, Local Sensitive Hashing, Genomes, Databases}
}