The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics

QasemiZadeh, Behrang; Handschuh, Siegfried

View/Open

W14-4807.pdf (378.7Kb)

Date

2014

Author

QasemiZadeh, Behrang

Handschuh, Siegfried

Metadata

Show full item record

Usage

This item's downloads: 753 (view details)

Recommended Citation

QasemiZadeh, Behrang; Handschuh, Siegfried; (2014) The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics . In: Patrick Drouin and Natalia Grabar and Thierry Hamon and Kyo Kageura eds. COLING 2014: 4th International Workshop on Computational Terminology Dublin, Ireland, 2014-08-23- 2014-08-23

Published Version

http://www.aclweb.org/anthology/W14-4807

Abstract

This paper introduces ACL RD-TEC: a dataset for evaluating the extraction and classification of terms from literature in the domain of computational linguistics. The dataset is derived from the Association for Computational Linguistics anthology reference corpus (ACL ARC). In its first release, the ACL RD-TEC consists of automatically segmented, part-of-speech-tagged ACL ARC documents, three lists of candidate terms, and more than 82,000 manually annotated terms. The annotated terms are marked as either valid or invalid, and valid terms are further classified as technology and non-technology terms. Technology terms signify methods, algorithms, and solutions in computational linguistics. The paper describes the dataset and reports the relevant statistics. We hope the step described in this paper encourages a collaborative effort towards building a full-fledged annotated corpus from the computational linguistics literature.

Description

Conference paper

URI

http://www.aclweb.org/anthology/W14-4807
http://hdl.handle.net/10379/4489

Collections

Data Science Institute (Conference Papers)

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland