Pre-Processing of English-Hindi Corpus for Statistical Machine Translation

Arora, Karunesh Kumar; Agrawal, Shyam S.

doi:10.13053/cys-21-4-2697

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Abstract

ARORA, Karunesh Kumar and AGRAWAL, Shyam S.. Pre-Processing of English-Hindi Corpus for Statistical Machine Translation. Comp. y Sist. [online]. 2017, vol.21, n.4, pp.725-737. ISSN 2007-9737. https://doi.org/10.13053/cys-21-4-2697.

Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, which makes it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair.

Keywords : Statistical machine translation; preprocessing; normalization; named entity handling.

· text in English · English (

pdf )