License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.10
URN: urn:nbn:de:0030-drops-170446
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17044/
Suzuki, Yoshihiko ;
Myers, Gene
Accurate k-mer Classification Using Read Profiles
Abstract
Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.
BibTeX - Entry
@InProceedings{suzuki_et_al:LIPIcs.WABI.2022.10,
author = {Suzuki, Yoshihiko and Myers, Gene},
title = {{Accurate k-mer Classification Using Read Profiles}},
booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
pages = {10:1--10:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-243-3},
ISSN = {1868-8969},
year = {2022},
volume = {242},
editor = {Boucher, Christina and Rahmann, Sven},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17044},
URN = {urn:nbn:de:0030-drops-170446},
doi = {10.4230/LIPIcs.WABI.2022.10},
annote = {Keywords: K-mer, K-mer count, K-mer classification, HiFi sequencing}
}
Keywords: |
|
K-mer, K-mer count, K-mer classification, HiFi sequencing |
Collection: |
|
22nd International Workshop on Algorithms in Bioinformatics (WABI 2022) |
Issue Date: |
|
2022 |
Date of publication: |
|
26.08.2022 |