License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.10
URN: urn:nbn:de:0030-drops-170446
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2022/17044/
Go to the corresponding LIPIcs Volume Portal


Suzuki, Yoshihiko ; Myers, Gene

Accurate k-mer Classification Using Read Profiles

pdf-format:
LIPIcs-WABI-2022-10.pdf (3 MB)


Abstract

Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.

BibTeX - Entry

@InProceedings{suzuki_et_al:LIPIcs.WABI.2022.10,
  author =	{Suzuki, Yoshihiko and Myers, Gene},
  title =	{{Accurate k-mer Classification Using Read Profiles}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{10:1--10:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2022/17044},
  URN =		{urn:nbn:de:0030-drops-170446},
  doi =		{10.4230/LIPIcs.WABI.2022.10},
  annote =	{Keywords: K-mer, K-mer count, K-mer classification, HiFi sequencing}
}

Keywords: K-mer, K-mer count, K-mer classification, HiFi sequencing
Collection: 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Issue Date: 2022
Date of publication: 26.08.2022


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI