License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.LDK.2021.23
URN: urn:nbn:de:0030-drops-145597
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/14559/
Manjunath, Sampritha H. ;
McCrae, John P.
Encoder-Attention-Based Automatic Term Recognition (EA-ATR)
Abstract
Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA - annotated biological terms and Krapivin - scientific papers from the computer science domain.
BibTeX - Entry
@InProceedings{manjunath_et_al:OASIcs.LDK.2021.23,
author = {Manjunath, Sampritha H. and McCrae, John P.},
title = {{Encoder-Attention-Based Automatic Term Recognition (EA-ATR)}},
booktitle = {3rd Conference on Language, Data and Knowledge (LDK 2021)},
pages = {23:1--23:13},
series = {Open Access Series in Informatics (OASIcs)},
ISBN = {978-3-95977-199-3},
ISSN = {2190-6807},
year = {2021},
volume = {93},
editor = {Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2021/14559},
URN = {urn:nbn:de:0030-drops-145597},
doi = {10.4230/OASIcs.LDK.2021.23},
annote = {Keywords: Automatic Term Recognition, Term Extraction, BERT, EEAP, Deep Learning for ATR}
}
Keywords: |
|
Automatic Term Recognition, Term Extraction, BERT, EEAP, Deep Learning for ATR |
Collection: |
|
3rd Conference on Language, Data and Knowledge (LDK 2021) |
Issue Date: |
|
2021 |
Date of publication: |
|
30.08.2021 |