TED-MWE: a bilingual parallel corpus with MWE annotation: Towards a methodology for annotating MWEs in parallel multilingual corpora
View/ Open
Date
2015-12-03Author
Monti, Johanna
Sangati, Federico
Arcan, Mihael
Metadata
Show full item recordUsage
This item's downloads: 88 (view details)
Recommended Citation
Monti, Johanna, Sangati, Federico, & Arcan, Mihael. (2015). TED-MWE: a bilingual parallel corpus with MWE annotation: Towards a methodology for annotating MWEs in parallel multilingual corpora. Paper presented at the Second Italian Conference on Computational Linguistics (CLiC-it 2015), Trento, Italy, 3-4 December.
Published Version
Abstract
The translation of Multiword expressions (MWE) by Machine Translation (MT) represents a big challenge, and although MT has considerably improved in recent years, MWE mistranslations still occur very frequently. There is the need to develop large data sets, mainly parallel corpora, annotated with MWEs, since they are useful both for SMT training purposes and MWE translation quality evaluation. This paper describes a methodology to annotate a parallel spoken corpus with MWEs. The dataset used for this experiment is an English-Italian corpus extracted from the TED spoken corpus and complemented by an SMT output.