SciELO - Scientific Electronic Library Online

 
vol.26 número4Cardiovascular Disease Detection Using Machine LearningAgent-Based Modeling for Evaluation of Transportation Mode Selection in the State of Guanajuato, Mexico índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Resumo

LASKAR, Sahinur Rahman; MANNA, Riyanka; PAKRAY, Partha  e  BANDYOPADHYAY, Sivaji. A Domain Specific Parallel Corpus and Enhanced English-Assamese Neural Machine Translation. Comp. y Sist. [online]. 2022, vol.26, n.4, pp.1669-1687.  Epub 17-Mar-2023. ISSN 2007-9737.  https://doi.org/10.13053/cys-26-4-4423.

Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pretrained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.

Palavras-chave : English-Assamese; low-resource; neural machine translation; parallel corpus; data augmentation; prior alignment; language model.

        · texto em Inglês     · Inglês ( pdf )