License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/DagSemProc.10231.4
URN: urn:nbn:de:0030-drops-27384
Go to the corresponding Portal

Pizzi, Cinzia

Efficient computation of statistics for words with mismatches

10231.PizziCinzia.Paper.2738.pdf (0.3 MB)


Since early stages of bioinformatics, substrings played a crucial role in the search and discovery of significant biological signals. Despite the advent of a large number of different approaches and models toaccomplish these tasks, substrings continue to be widely used to determine statistical distributions and compositions of biological sequences at various levels of details.
Here we overview efficient algorithms that were recently proposed to
compute the actual and the expected frequency for words with k mismatches, when it is assumed that the words of interest occur at least once exactly in the sequence under analysis. Efficiency means these algorithms are polynomial in k rather than exponential as with an enumerative approach, and independent on the length of the query word.
These algorithms are all based on a common incremental approach of
a preprocessing step that allows to answer queries related to any word
occurring in the text efficiently. The same approach can be used with a
sliding window scanning of the sequence to compute the same statistics
for words of fixed lengths, even more efficiently.
The efficient computation of both expected and actual frequency of sub-
strings, combined with a study on the monotonicity of popular scores
such as z-scores, allows to build tables of feasible size in reasonable time,
and can therefore be used in practical applications.

BibTeX - Entry

  author =	{Pizzi, Cinzia},
  title =	{{Efficient computation of statistics for words with mismatches}},
  booktitle =	{Structure Discovery in Biology: Motifs, Networks \& Phylogenies},
  pages =	{1--22},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2010},
  volume =	{10231},
  editor =	{Alberto Apostolico and Andreas Dress and Laxmi Parida},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-27384},
  doi =		{10.4230/DagSemProc.10231.4},
  annote =	{Keywords: Statistics on words, mismatches, dynamic programming, biological sequences.}

Keywords: Statistics on words, mismatches, dynamic programming, biological sequences.
Collection: 10231 - Structure Discovery in Biology: Motifs, Networks & Phylogenies
Issue Date: 2010
Date of publication: 23.08.2010

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI