Abstract
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the BurrowsWheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWTbased structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bidirectionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSSbased assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.
BibTeX  Entry
@InProceedings{dazdomnguez_et_al:LIPIcs:2019:10497,
author = {Diego D{\'i}azDom{\'i}nguez and Travis Gagie and Gonzalo Navarro},
title = {{Simulating the DNA Overlap Graph in Succinct Space}},
booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
pages = {26:126:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771030},
ISSN = {18688969},
year = {2019},
volume = {128},
editor = {Nadia Pisanti and Solon P. Pissis},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2019/10497},
URN = {urn:nbn:de:0030drops104978},
doi = {10.4230/LIPIcs.CPM.2019.26},
annote = {Keywords: Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees}
}
Keywords: 

Overlap graph, de Bruijn graph, DNA sequencing, Succinct ordinal trees 
Collection: 

30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019) 
Issue Date: 

2019 
Date of publication: 

06.06.2019 