License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.LDK.2021.35
URN: urn:nbn:de:0030-drops-145714
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/14571/
Go to the corresponding OASIcs Volume Portal


Picca, Davide ; Gay-Crosier, Cyrille

An Automatic Partitioning of Gutenberg.org Texts

pdf-format:
OASIcs-LDK-2021-35.pdf (0.6 MB)


Abstract

Over the last 10 years, the automatic partitioning of texts has raised the interest of the community. The automatic identification of parts of texts can provide a faster and easier access to textual analysis. We introduce here an exploratory work for multi-part book identification. In an early attempt, we focus on Gutenberg.org which is one of the projects that has received the largest public support in recent years. The purpose of this article is to present a preliminary system that automatically classifies parts of texts into 35 semantic categories. An accuracy of more than 93% on the test set was achieved. We are planning to extend this effort to other repositories in the future.

BibTeX - Entry

@InProceedings{picca_et_al:OASIcs.LDK.2021.35,
  author =	{Picca, Davide and Gay-Crosier, Cyrille},
  title =	{{An Automatic Partitioning of Gutenberg.org Texts}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{35:1--35:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/14571},
  URN =		{urn:nbn:de:0030-drops-145714},
  doi =		{10.4230/OASIcs.LDK.2021.35},
  annote =	{Keywords: Digital Humanities, Machine Learning, Corpora}
}

Keywords: Digital Humanities, Machine Learning, Corpora
Collection: 3rd Conference on Language, Data and Knowledge (LDK 2021)
Issue Date: 2021
Date of publication: 30.08.2021


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI