Show simple item record

dc.contributor.authorGoswami, Koustava
dc.contributor.authorSarkar, Rajdeep
dc.contributor.authorChakravarthi, Bharathi Raja
dc.contributor.authorFransen, Theodorus
dc.contributor.authorMcCrae, John P.
dc.date.accessioned2022-09-27T08:19:13Z
dc.date.available2022-09-27T08:19:13Z
dc.date.issued2020-12
dc.identifier.citationKoustava Goswami, Rajdeep Sarkar, Bharathi Raja Chakravarthi, Theodorus Fransen, and John P. McCrae. 2020. Unsupervised Deep Language and Dialect Identification for Short Texts. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1606–1617, Barcelona, Spain (Online). International Committee on Computational Linguistics.en_IE
dc.identifier.urihttp://hdl.handle.net/10379/17393
dc.description.abstractAutomatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.en_IE
dc.description.sponsorshipThis publication has emanated from research in part supported by the Irish Research Council under grant number IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language for Minority and Historical Languages). It is co-funded by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2 (Insight 2) and Irish Research Council under project ID GOIPG/2019/3480. We would like to thank Ms. Omnia Zayed and Ms. Priya Rani for their valuable comments and suggestions towards improving our paper. We would also like to thank the anonymous reviewers for their insights on this work.en_IE
dc.formatapplication/pdfen_IE
dc.language.isoenen_IE
dc.publisherInternational Committee on Computational Linguisticsen_IE
dc.relation.ispartofProceedings of the 28th International Conference on Computational Linguisticsen
dc.rightsAttribution 4.0 International (CC BY 4.0)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectDeep Languageen_IE
dc.subjectDialect Identificationen_IE
dc.subjectShort Textsen_IE
dc.subjectAutomatic Language Identification (LI)en_IE
dc.subjectDialect Identification (DI)en_IE
dc.subjectnatural language processingen_IE
dc.titleUnsupervised deep language and dialect identification for short textsen_IE
dc.typeConference Paperen_IE
dc.date.updated2022-09-25T16:17:04Z
dc.identifier.doi10.18653/v1/2020.coling-main.141
dc.local.publishedsourcehttps://doi.org/10.18653/v1/2020.coling-main.141en_IE
dc.description.peer-reviewednon-peer-reviewed
dc.contributor.funderIrish Research Councilen_IE
dc.contributor.funderScience Foundation Irelanden_IE
dc.internal.rssid29174551
dc.local.contactTheodorus Fransen. Email: theodorus.fransen@nuigalway.ie
dc.local.copyrightcheckedYes
dc.local.versionPUBLISHED
dcterms.projectinfo:eu-repo/grantAgreement/SFI/SFI Research Centres/12/RC/2289/IE/INSIGHT - Irelands Big Data and Analytics Research Centre/en_IE
nui.item.downloads31


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International (CC BY 4.0)
Except where otherwise noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)