License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.SLATE.2018.16
URN: urn:nbn:de:0030-drops-92747
Go to the corresponding OASIcs Volume Portal

Canosa, Afonso Xavier

Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper)

OASIcs-SLATE-2018-16.pdf (0.6 MB)


A bitext produced from a Portuguese historical text and its English translation, Fernão Mendes Pinto's Pilgrimage, serves as a case study to describe the creation of a parallel corpus and investigate which linguistic and textual units are the best indicators of alignability. The process of building the corpus goes through preparation of transcriptions, annotation, segmentation and sentence alignment. Once the bitext is ready, the corpus is used to inquire which units appear as more relevant to predict that both texts are parallel. From the largest content units, those of chapters, to sentences, word types, tokens and characters, the latest, despite being the unit with less textual and linguistic significance, were found to be the best indicator of both texts being alignable.

BibTeX - Entry

  author =	{Afonso Xavier Canosa},
  title =	{{Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper)}},
  booktitle =	{7th Symposium on Languages, Applications and Technologies  (SLATE 2018)},
  pages =	{16:1--16:7},
  series =	{OpenAccess Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-072-9},
  ISSN =	{2190-6807},
  year =	{2018},
  volume =	{62},
  editor =	{Pedro Rangel Henriques and Jos{\'e} Paulo Leal and Ant{\'o}nio Menezes Leit{\~a}o and Xavier G{\'o}mez Guinovart},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-92747},
  doi =		{10.4230/OASIcs.SLATE.2018.16},
  annote =	{Keywords: parallel corpora, text alignment, bitexts}

Keywords: parallel corpora, text alignment, bitexts
Collection: 7th Symposium on Languages, Applications and Technologies (SLATE 2018)
Issue Date: 2018
Date of publication: 13.07.2018

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI