SemR-11: a multi-lingual gold-standard for semantic similarity and relatedness for eleven languages
Date
2018-05-07Author
Barzegar, Siamak
Davis, Brian
Zarrouk, Manel
Handschuh, Siegfried
Freitas, André
Metadata
Show full item recordUsage
This item's downloads: 96 (view details)
Recommended Citation
Barzegar, Siamak, Davis, Brian, Zarrouk, Manel, Handschuh, Siegfried, & Freitas, André. (2018). SemR-11: a multi-lingual gold-standard for semantic similarity and relatedness for eleven languages. Paper presented at the 11th edition of the Language Resources and Evaluation Conference (LREC2018), Miyazaki, Japan, 7-12 May.
Published Version
Abstract
This work describes SemR-11, a multi-lingual dataset for evaluating semantic similarity and relatedness for 11 languages (German,
French, Russian, Italian, Dutch, Chinese, Portuguese, Swedish, Spanish, Arabic and Persian). Semantic similarity and relatedness gold
standards have been initially used to support the evaluation of semantic distance measures in the context of linguistic and knowledge
resources and distributional semantic models. SemR-11 builds upon the English gold-standards of Miller & Charles (MC), Rubenstein &
Goodenough (RG), WordSimilarity 353 (WS-353), and Simlex-999, providing a canonical translation for them. The final dataset consists
of 15,917 word pairs and can be used to support the construction and evaluation of semantic similarity/relatedness and distributional
semantic models. As a case study, the SemR-11 test collections was used to investigate how different distributional semantic models
built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness
tasks.