License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.SEA.2021.12
URN: urn:nbn:de:0030-drops-137848
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/13784/
Go to the corresponding LIPIcs Volume Portal


Puglisi, Simon J. ; Zhukova, Bella

Document Retrieval Hacks

pdf-format:
LIPIcs-SEA-2021-12.pdf (2 MB)


Abstract

Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources).
The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.

BibTeX - Entry

@InProceedings{puglisi_et_al:LIPIcs.SEA.2021.12,
  author =	{Puglisi, Simon J. and Zhukova, Bella},
  title =	{{Document Retrieval Hacks}},
  booktitle =	{19th International Symposium on Experimental Algorithms (SEA 2021)},
  pages =	{12:1--12:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-185-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{190},
  editor =	{Coudert, David and Natale, Emanuele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/13784},
  URN =		{urn:nbn:de:0030-drops-137848},
  doi =		{10.4230/LIPIcs.SEA.2021.12},
  annote =	{Keywords: String Processing, Pattern matching, Document listing, Document retrieval, Succinct data structures, Repetitive text collections}
}

Keywords: String Processing, Pattern matching, Document listing, Document retrieval, Succinct data structures, Repetitive text collections
Collection: 19th International Symposium on Experimental Algorithms (SEA 2021)
Issue Date: 2021
Date of publication: 31.05.2021


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI