License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.SLATE.2018.16
URN: urn:nbn:de:0030-drops-92747
Go to the corresponding OASIcs Volume Portal

Canosa, Afonso Xavier

Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper)

OASIcs-SLATE-2018-16.pdf (0.6 MB)


A bitext produced from a Portuguese historical text and its English translation, Fernão Mendes Pinto's Pilgrimage, serves as a case study to describe the creation of a parallel corpus and investigate which linguistic and textual units are the best indicators of alignability. The process of building the corpus goes through preparation of transcriptions, annotation, segmentation and sentence alignment. Once the bitext is ready, the corpus is used to inquire which units appear as more relevant to predict that both texts are parallel. From the largest content units, those of chapters, to sentences, word types, tokens and characters, the latest, despite being the unit with less textual and linguistic significance, were found to be the best indicator of both texts being alignable.

Collection: 7th Symposium on Languages, Applications and Technologies (SLATE 2018)
Issue Date: 2018
Date of publication: 13.07.2018

