License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/DagSemProc.08261.10
URN: urn:nbn:de:0030-drops-16743
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2008/1674/
Go to the corresponding Portal |
Mäkinen, Veli ;
Navarro, Gonzalo ;
Sirén, Jouni ;
Välimäki, Niko
Storage and Retrieval of Individual Genomes
Abstract
A repetitive sequence collection is one where portions of a emph{base sequence} of length $n$ are repeated many times with small variations, forming a collection of total length $N$. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies $O(N log N)$ bits, which very soon inhibits
in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to $O(N log sigma)$ bits, where $sigma$ is the alphabet size. In practice, the space reduction is more than $10$-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection
We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length $n$ of the base sequence and the number $s$ of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on $N/n$.
We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
BibTeX - Entry
@InProceedings{makinen_et_al:DagSemProc.08261.10,
author = {M\"{a}kinen, Veli and Navarro, Gonzalo and Sir\'{e}n, Jouni and V\"{a}lim\"{a}ki, Niko},
title = {{Storage and Retrieval of Individual Genomes}},
booktitle = {Structure-Based Compression of Complex Massive Data},
pages = {1--14},
series = {Dagstuhl Seminar Proceedings (DagSemProc)},
ISSN = {1862-4405},
year = {2008},
volume = {8261},
editor = {Stefan B\"{o}ttcher and Markus Lohrey and Sebastian Maneth and Wojcieh Rytter},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2008/1674},
URN = {urn:nbn:de:0030-drops-16743},
doi = {10.4230/DagSemProc.08261.10},
annote = {Keywords: Pattern matching, text indexing, compressed data structures, comparative genomics}
}
Keywords: |
|
Pattern matching, text indexing, compressed data structures, comparative genomics |
Collection: |
|
08261 - Structure-Based Compression of Complex Massive Data |
Issue Date: |
|
2008 |
Date of publication: |
|
20.11.2008 |