SciELO - Scientific Electronic Library Online

 
 número43Examining the Validity of Cross-Lingual Word Sense DisambiguationLow Cost Construction of a Multilingual Lexicon from Bilingual Lists índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Polibits

versión On-line ISSN 1870-9044

Polibits  no.43 México ene./jun. 2011

 

Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

 

Marco Turchi and Maud Ehrmann

 

Joint Research Centre (JRC), IPSC – GlobSec, European Commission, Via Fermi 2749, 21027, Ispra (VA), Italy (e–mail: name.surname@jrc.ec.europa.eu.)

 

Manuscript received November 2, 2010.
Manuscript accepted for publication January 14, 2011.

 

Abstract

Translation capability of a Phrase–Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efflciently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En–Fr and Fr–En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out–of–vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.

Key words: Machine translation, knowledge, morphological resources.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

REFERENCES

[1] C. Callison–Burch, P. Koehn, C. Monz, and J. Schroeder, "Findings of the 2009 Workshop on Statistical Machine Translation," in Proceedings of WSMT, 2009, pp. 1–28.         [ Links ]

[2] C. Callison–Burch and M. Osborne, "Re–evaluating the role of BLEU in machine translation research," in Proceedings of EACL, 2006, pp. 249–256.         [ Links ]

[3] A. De Gispert, J.B. Mariño, and J.M. Crego, "Improving statistical machine translation by classifying and generalizing inflected verb forms," in Proceedings of 9th European Conference on Speech Communication and Technology, 2005, pp. 3193–3196.         [ Links ]

[4] T. Erjavec, "MULTEXT–East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora," in Proc. of the Fourth Intl. Conf. on Language Resources and Evaluation, 2004.         [ Links ]

[5] M. Federico, N. Bertoldi, and M. Cettolo, "IRSTLM: an open source toolkit for handling large scale language models," in Proceedings of Interspeech, 2008, pp. 1618–1621.         [ Links ]

[6] M. Garcia, J. Gimenez, and L. Marquez, "Enriching Statistical Translation Models Using a Domain–Independent Multilingual Lexical Knowledge Base," Lecture notes in computer science (Computational Linguistics and Intelligent Text Processing), vol. 5449, pp. 306–317, 2009.         [ Links ]

[7] S. Goldwater and D. McClosky, "Improving statistical MT through morphological analysis," in Proceedings of EMNLP, 2006, pp. 676–683.         [ Links ]

[8] N. Habash, "Four techniques for online handling of out–of–vocabulary words in Arabic–English statistical machine translation," in Proceedings of ACL, 2006, pp. 57–60.         [ Links ]

[9] G. Haffari, M. Roy, and A. Sarkar, "Active learning for statistical phrase–based machine translation," in Proceedings of NAACL, 2009, pp. 415–423.         [ Links ]

[10] H. Johnson, J. Martin, G. Foster, and R. Kuhn, "Improving translation quality by discarding most of the phrasetable," Proceedings of EMNLP–CoNLL, 2007, pp. 967–975.         [ Links ]

[11] P. Koehn, "Statistical significance tests for machine translation evaluation," in Proceedings of EMNLP, 2005, pp. 388–395.         [ Links ]

[12] P. Koehn, "Europarl: A parallel corpus for statistical machine translation," in Proceedings of MT summit, 2005.         [ Links ]

[13] P. Koehn and H. Hoang, "Factored translation models," in Proceedings of EMNLP–CoNLL, 2007, pp. 868–876.         [ Links ]

[14] P. Koehn, H. Hoang, A. Birch, C. Callison–Burch, M. Federico and others, "Moses: Open source toolkit for statistical machine translation," in Proceedings of ACL, demonstration session, 2007, pp. 1618–1621.         [ Links ]

[15] A. Lavie and A. Agarwal, "METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments," in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 228–231.         [ Links ]

[16] Y. Marton, C. Callison–Burch, and P. Resnik, "Improved statistical machine translation using monolingually–derived paraphrases," in Proceedings of EMNLP, 2009, pp. 381–390.         [ Links ]

[17] S. Mirkin, L. Specia, N. Cancedda, I. Dagan, M. Dymetman, and I. Szpektor, "ource–language entailment modeling for translating unknown terms," in Proceedings of ACL, 2009, pp. 791–799.         [ Links ]

[18] K. Papineni, S. Roukos, T. Ward and W. J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of ACL, 2002, pp. 311–318.         [ Links ]

[19] J. Tinsley, M. Hearne and A. Way, "Exploiting parallel treebanks to improve phrase–based statistical machine translation," in Proceedings of CICLing, 2009, pp. 318–331.         [ Links ]

[20] M. Turchi, T. DeBie, and N. Cristianini, "Learning performance of a machine translation system: a statistical and computational analysis," Proceedings of the Third Workshop on Statistical Machine Translation, 2008, pp. 35–43.         [ Links ]

[21] M. Yang and K. Kirchhoff, "Phrase–based backoff models for machine translation of highly inflected languages," in Proceedings of EACL, 2006, pp. 41–48.         [ Links ]