License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2019.26
URN: urn:nbn:de:0030-drops-104978
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2019/10497/
Go to the corresponding LIPIcs Volume Portal


Díaz-Domínguez, Diego ; Gagie, Travis ; Navarro, Gonzalo

Simulating the DNA Overlap Graph in Succinct Space

pdf-format:
LIPIcs-CPM-2019-26.pdf (4 MB)


Abstract

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

BibTeX - Entry

@InProceedings{dazdomnguez_et_al:LIPIcs:2019:10497,
  author =	{Diego D{\'i}az-Dom{\'i}nguez and Travis Gagie and Gonzalo Navarro},
  title =	{{Simulating the DNA Overlap Graph in Succinct Space}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{26:1--26:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Nadia Pisanti and Solon P. Pissis},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2019/10497},
  URN =		{urn:nbn:de:0030-drops-104978},
  doi =		{10.4230/LIPIcs.CPM.2019.26},
  annote =	{Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees}
}

Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees
Collection: 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Issue Date: 2019
Date of publication: 06.06.2019


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI