SciELO - Scientific Electronic Library Online

 número43Examining the Validity of Cross-Lingual Word Sense DisambiguationLow Cost Construction of a Multilingual Lexicon from Bilingual Lists índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados



Links relacionados

  • No hay artículos similaresSimilares en SciELO



versión On-line ISSN 1870-9044

Polibits  no.43 México ene./jun. 2011


Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources


Marco Turchi and Maud Ehrmann


Joint Research Centre (JRC), IPSC – GlobSec, European Commission, Via Fermi 2749, 21027, Ispra (VA), Italy (e–mail:


Manuscript received November 2, 2010.
Manuscript accepted for publication January 14, 2011.



Translation capability of a Phrase–Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efflciently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En–Fr and Fr–En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out–of–vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.

Key words: Machine translation, knowledge, morphological resources.





[1] C. Callison–Burch, P. Koehn, C. Monz, and J. Schroeder, "Findings of the 2009 Workshop on Statistical Machine Translation," in Proceedings of WSMT, 2009, pp. 1–28.         [ Links ]

[2] C. Callison–Burch and M. Osborne, "Re–evaluating the role of BLEU in machine translation research," in Proceedings of EACL, 2006, pp. 249–256.         [ Links ]

[3] A. De Gispert, J.B. Mariño, and J.M. Crego, "Improving statistical machine translation by classifying and generalizing inflected verb forms," in Proceedings of 9th European Conference on Speech Communication and Technology, 2005, pp. 3193–3196.         [ Links ]

[4] T. Erjavec, "MULTEXT–East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora," in Proc. of the Fourth Intl. Conf. on Language Resources and Evaluation, 2004.         [ Links ]

[5] M. Federico, N. Bertoldi, and M. Cettolo, "IRSTLM: an open source toolkit for handling large scale language models," in Proceedings of Interspeech, 2008, pp. 1618–1621.         [ Links ]

[6] M. Garcia, J. Gimenez, and L. Marquez, "Enriching Statistical Translation Models Using a Domain–Independent Multilingual Lexical Knowledge Base," Lecture notes in computer science (Computational Linguistics and Intelligent Text Processing), vol. 5449, pp. 306–317, 2009.         [ Links ]

[7] S. Goldwater and D. McClosky, "Improving statistical MT through morphological analysis," in Proceedings of EMNLP, 2006, pp. 676–683.         [ Links ]

[8] N. Habash, "Four techniques for online handling of out–of–vocabulary words in Arabic–English statistical machine translation," in Proceedings of ACL, 2006, pp. 57–60.         [ Links ]

[9] G. Haffari, M. Roy, and A. Sarkar, "Active learning for statistical phrase–based machine translation," in Proceedings of NAACL, 2009, pp. 415–423.         [ Links ]

[10] H. Johnson, J. Martin, G. Foster, and R. Kuhn, "Improving translation quality by discarding most of the phrasetable," Proceedings of EMNLP–CoNLL, 2007, pp. 967–975.         [ Links ]

[11] P. Koehn, "Statistical significance tests for machine translation evaluation," in Proceedings of EMNLP, 2005, pp. 388–395.         [ Links ]

[12] P. Koehn, "Europarl: A parallel corpus for statistical machine translation," in Proceedings of MT summit, 2005.         [ Links ]

[13] P. Koehn and H. Hoang, "Factored translation models," in Proceedings of EMNLP–CoNLL, 2007, pp. 868–876.         [ Links ]

[14] P. Koehn, H. Hoang, A. Birch, C. Callison–Burch, M. Federico and others, "Moses: Open source toolkit for statistical machine translation," in Proceedings of ACL, demonstration session, 2007, pp. 1618–1621.         [ Links ]

[15] A. Lavie and A. Agarwal, "METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments," in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 228–231.         [ Links ]

[16] Y. Marton, C. Callison–Burch, and P. Resnik, "Improved statistical machine translation using monolingually–derived paraphrases," in Proceedings of EMNLP, 2009, pp. 381–390.         [ Links ]

[17] S. Mirkin, L. Specia, N. Cancedda, I. Dagan, M. Dymetman, and I. Szpektor, "ource–language entailment modeling for translating unknown terms," in Proceedings of ACL, 2009, pp. 791–799.         [ Links ]

[18] K. Papineni, S. Roukos, T. Ward and W. J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of ACL, 2002, pp. 311–318.         [ Links ]

[19] J. Tinsley, M. Hearne and A. Way, "Exploiting parallel treebanks to improve phrase–based statistical machine translation," in Proceedings of CICLing, 2009, pp. 318–331.         [ Links ]

[20] M. Turchi, T. DeBie, and N. Cristianini, "Learning performance of a machine translation system: a statistical and computational analysis," Proceedings of the Third Workshop on Statistical Machine Translation, 2008, pp. 35–43.         [ Links ]

[21] M. Yang and K. Kirchhoff, "Phrase–based backoff models for machine translation of highly inflected languages," in Proceedings of EACL, 2006, pp. 41–48.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons