Leveraging orthographic information to improve machine translation of under-resourced languages
Asoka Chakravarthi, Bharathi Raja
MetadataShow full item record
This item's downloads: 11078 (view details)
This thesis describes our improvement of word sense translation for under-resourced languages utilizing orthographic information with a particular focus on creating resources using machine translation. The first target of this thesis is cleaning the noisy corpus in the form of code-mixed content at word-level based on orthographic information to improve machine translation quality. Our results indicate that the proposed removing of code-mixed text based on orthography results in improvement for Dravidian languages. We then turn our interest to the usage of training data from closely-related languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. We propose to alleviate the problem of different scripts by transcribing the native script into a common representation such as the Latin script or the International Phonetic Alphabet (IPA). We also show that our method could aid the creation or improvement of wordnets for under-resourced languages using machine translation. Further, we investigate bilingual lexicon induction using pre-trained monolingual word embeddings and orthographic information. We use existing resources such as IndoWordNet entries as a seed dictionary and test set for the under-resourced Dravidian languages. To take advantage of orthographic information, we propose to bring the related languages into a single script before creating word embeddings, and use the longest common subsequence to take advantage of cognate information. Our methods for under-resourced word sense translation of Dravidian languages outperformed state-of-the art systems in terms of both automatic and manual evaluation.