TED-MWE: a bilingual parallel corpus with MWE annotation: Towards a methodology for annotating MWEs in parallel multilingual corpora

Monti, Johanna; Sangati, Federico; Arcan, Mihael

View/Open

clic15.pdf (225.4Kb)

Date

2015-12-03

Author

Monti, Johanna

Sangati, Federico

Arcan, Mihael

Metadata

Show full item record

Usage

This item's downloads: 88 (view details)

Recommended Citation

Monti, Johanna, Sangati, Federico, & Arcan, Mihael. (2015). TED-MWE: a bilingual parallel corpus with MWE annotation: Towards a methodology for annotating MWEs in parallel multilingual corpora. Paper presented at the Second Italian Conference on Computational Linguistics (CLiC-it 2015), Trento, Italy, 3-4 December.

Published Version

https://dx.doi.org/10.4000/books.aaccademia.1514

Abstract

The translation of Multiword expressions (MWE) by Machine Translation (MT) represents a big challenge, and although MT has considerably improved in recent years, MWE mistranslations still occur very frequently. There is the need to develop large data sets, mainly parallel corpora, annotated with MWEs, since they are useful both for SMT training purposes and MWE translation quality evaluation. This paper describes a methodology to annotate a parallel spoken corpus with MWEs. The dataset used for this experiment is an English-Italian corpus extracted from the TED spoken corpus and complemented by an SMT output.

URI

http://hdl.handle.net/10379/14901

Collections

Data Science Institute (Conference Papers)

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Ireland