License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/DagSemProc.06491.14
URN: urn:nbn:de:0030-drops-10533
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2007/1053/
Go to the corresponding Portal |
Pilz, Thomas
Searching in text databases with non-standard orthography
Abstract
In this paper we present research results of the recent project "Rule based search in text data bases with non-standard orthography". There are numerous steps involved from facsimile to searchable text-document. This paper focuses on techniques to ensure better retrieval results on historical texts with non-standard spellings. Historical documents – especially those in black letter fonts – encourage recognition errors. Adequate preparation of the image sources prior to OCR can successfully reduce the amount of misinterpretation of characters. Furthermore, the application of a search engine with categorized distance measures between user interface and text database can help to enhance retrieval results. Specific metrics cover problems in optical character recognition, transcription and historical spelling variation. With a synoptic view interface the users can be kept completely unaware of the methods applied after their queries.
BibTeX - Entry
@InProceedings{pilz:DagSemProc.06491.14,
author = {Pilz, Thomas},
title = {{Searching in text databases with non-standard orthography}},
booktitle = {Digital Historical Corpora- Architecture, Annotation, and Retrieval},
pages = {1--2},
series = {Dagstuhl Seminar Proceedings (DagSemProc)},
ISSN = {1862-4405},
year = {2007},
volume = {6491},
editor = {Lou Burnard and Milena Dobreva and Norbert Fuhr and Anke L\"{u}deling},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2007/1053},
URN = {urn:nbn:de:0030-drops-10533},
doi = {10.4230/DagSemProc.06491.14},
annote = {Keywords: Rule based search, Optical character recognition, spelling variation, edit distance}
}
Keywords: |
|
Rule based search, Optical character recognition, spelling variation, edit distance |
Collection: |
|
06491 - Digital Historical Corpora- Architecture, Annotation, and Retrieval |
Issue Date: |
|
2007 |
Date of publication: |
|
13.06.2007 |