SciELO - Scientific Electronic Library Online

vol.19 issue2Admission Control and Channel Allocation for Dynamic Spectrum Access using Multi-objective OptimizationDesign of a General Purpose 8-bit RISC Processor for Computer Architecture Learning author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand




Related links

  • Have no similar articlesSimilars in SciELO


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.19 n.2 México Apr./Jun. 2015 



Segmentation Strategies to Face Morphology Challenges in Brazilian-Portuguese/English Statistical Machine Translation and Its Integration in Cross-Language Information Retrieval


Marta R. Costa-jussà


University of São Paulo, Institute of Mathematics and Statistics, Computer Science Department, Brazil.

Corresponding author is Marta R. Costa-jussà.


Article received on 03/09/2013.
Accepted on 08/05/2015.



The use of morphology is particularly interesting in the context of statistical machine translation in order to reduce data sparseness and compensate a lack of training corpus. In this work, we propose several approaches to introduce morphology knowledge into a standard phrase-based machine translation system. We provide word segmentation using two different tools (COGROO and MORFESSOR) which allow reducing the vocabulary and data sparseness. Then, to these segmentations we add the morphological information of a POS language model. We combine all these approaches using a Minimum Bayes Risk strategy. Experiments show significant improvements from the enhanced system over the baseline system on the Brazilian-Portuguese/English language pair. Finally, we report a case study of the impact of enhancing the statistical machine translation system with morphology in a cross-language application system such as ONAIR which allows users to look for information in video fragments through queries in natural language.

Keywords: Morphology, factored-based machine translation, cross-language information retrieval.





The author would like to specially thank Prof. Renata Wassermann for her help and assistance; Christian Paz-Trillo for his support on the ONAIR system; William Colen for his help with the COGROO system; Stella O. Tagnin for providing the out-of-domain corpus and Fabiano Luz for his dedication to parallelize this corpus.

This work has been partially supported by FAPESP through the ONAIR project (2010/19111-9) and the visiting researcher program (2012/02131-2), by the Spanish Ministry of Economy and Competitiveness through the Juan de la Cierva fellowship program and contract TEC2012-38939-C03-02, as well as from the Seventh Framework Program of the European Commission through the International Outgoing Fellowship Marie Curie Action (IMTraP-2011-29951).



1. Avramidis, E. & Koehn, P. (2008). Enriching morphologically poor languages for statistical machine translation. Proc. of the conference of the Association for Computational Linguistics and Human Language Technology (ACL-HLT), pp. 763-770.         [ Links ]

2. Aziz, W. & Specia, L. (2011). Fully automatic compilation of a portuguese-English parallel corpus for statistical machine translation. STIL2011, Cuiaba, MT.         [ Links ]

3. Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison Wesley Longman.         [ Links ]

4. Berger, A., Pietra, S. D., & Pietra., V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, Vol. 22, No. 1, pp. 39-71.         [ Links ]

5. Bilmes, J. A. & Kirchhoff, K. (2003). Factored language models and generalized parallel backoff. Proceedings of HLT/NACCL, pp. 4-6.         [ Links ]

6. Bojar, O. & Tamchyna, A. (2011). Forms wanted: Training smt on monolingual data. Workshop of Machine Translation and Morphologically-Rich Languages.         [ Links ]

7. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A Statistical Approach to Machine Translation. Computational Linguistics, Vol. 16, No. 2, pp. 79-85.         [ Links ]

8. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, Vol. 19, No. 2, pp. 263-311.         [ Links ]

9. Chen, J. & Bao, Y. (2009). Cross-language search: The case of google language tools. First Monday, Vol. 14, No. 3-2.         [ Links ]

10. Costa-jussà, M. R., Paz-Trillo, C., &Wassermann, R. (2012). initial approaches on cross-lingual informaiton retrieval using statistical machine translation on user queries. Proceedings of the Joint V Seminar on Ontology Research in Brazil and VII International Workshop on Metamodels, Ontologies and Semantic Technologies, Recife, Brazil, pp. 25-35.         [ Links ]

11. Creutz, M. & Lagus, K. (2005). inducing the morphological lexicon of a natural language from unan-notated text. Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05), Espoo, Finland.         [ Links ]

12. Creutz, M., Lagus, K., & Virpioja, S. (2005). unsupervised morphology induction using morfessor. Finite-State Methods and Natural Language Processing, 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1-2, 2005. Revised Papers, volume 4002 of Lecture Notes in Computer Science, Springer, pp. 300-301.         [ Links ]

13. Daume, H. C., III (2006). Practical structured learning techniques for natural language processing. Ph.D. thesis, Los Angeles, CA, USA. AAI3337548.         [ Links ]

14. Ding, Y. (2001 ). Ir and ai: The role of ontology. International Conference of Asian Digital Libraries.         [ Links ]

15. Forcada, M. L. (2006). Open-source machine translation: an opportunity for minor languages. Strategies for developing machine translation for minority languages (5th SALTMIL workshop on Minority Languages).         [ Links ]

16. Formiga, L., Costa-jussà, M. R., Mariño, J. B., Fonollosa, J. A. R., Barrón-Cedeño, A., & Márquez, L. (2013). The TALP-UPC phrase-based translation systems for WMT13: System combination with morphology generation, domain adaptation and corpus filtering. Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, pp. 134-140.         [ Links ]

17. Gruber, T. R. (1993). A translation approach to portable ontologies. Knowledge Acquisition, Vol. 5, No. 2, pp. 199-220.         [ Links ]

18. Karageorgakis, P., Potamianos, A., & K., I. (2005). Towards incorporating language morphology into statistical machine translation systems. Automatic Speech Recognition and Understanding Workshop.         [ Links ]

19. Kinoshita, J., Salvador, L. N., & Menezes, C. E. (2007). Cogroo - an openoffice grammar checker. Proceedings of the Seventh international Conference on intelligent Systems Design and Applications (ISDA), IEEE Computer Society, pp. 525-530.         [ Links ]

20. Kneser & Ney (1995). Improved backing-off for m-gram language modeling. IEEE Inte. Conf. on Acoustics, Speech and Signal Processing, Detroit, MI, pp. 49-52.         [ Links ]

21. Koehn, P. (2004). Statistical Significance Tests For Machine Translation Evaluation. Proceedings of EMNLP, pp. 388-395.         [ Links ]

22. Koehn, P. & Hoang, H. (2007). Factored translation models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, Prague, Czech Republic, pp. 868-876.         [ Links ]

23. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: open source toolkit for statistical machine translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACLL07), Prague, Czech Republic, pp. 177-180.         [ Links ]

24. Koehn, P., Och, F., & Marcu, D. (2003). Statistical Phrase-Based Translation. Proc. of the 41th Annual Meeting of the Association for Computational Linguistics.         [ Links ]

25. Kumar, S. & Byrne, W. (2002). Minimum bayes-risk word alignments of bilingual texts. Proceedings of the ACL-02 conference on Empirical methods in natural language processing, EMNLP '02, Stroudsburg, PA, USA, pp. 140-147.         [ Links ]

26. Oard, D. W. & Diekema, A. R. (1998). Cross-Language information retrieval. Annual Review of Information Science and Technology (ARIST), Vol. 33, pp. 223-256.         [ Links ]

27. Och, F. (2003). Minimum Error Rate Training In Statistical Machine Translation. Proc. of the 41th Annual Meeting of the Association for Computational Linguistics, pp. 160-167.         [ Links ]

28. Och, F. & Ney, H. (2002). Dicriminative training and maximum entropy models for statistical machine translation. Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 295-302.         [ Links ]

29. Och, F. J. & Ney, H. (2000). Improved Statistical Alignment Models. Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, Hongkong, China, pp. 440-447.         [ Links ]

30. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pp. 311-318.         [ Links ]

31. Paz-Trillo, C., Wassermann, R., & Braga, P. P. (2005). An information retrieval application using ontologies. J. Braz. Comp. Soc., Vol. 11, No. 2, pp. 17-31.         [ Links ]

32. Silva, W., Finger, M., & Menezes, C. (2010). Open text annotators using apache uima. PROPOR.         [ Links ]

33. Silva, W. D. (2012). CoGrOO: Corretor Gramatical acoplavel ao LibreOffice e Apache OpenOffice. CCSL IME/USP, São Paulo, Brasil.         [ Links ]

34. Stolcke, A. (2002). SRILM: an extensible language modeling toolkit. Proc. of the Int. Conf. on Spoken Language Processing, Denver, CO, pp. 901-904.         [ Links ]

35. Tillman, C. (2004). A Block Orientation Model for Statistical Machine Translation. HLT-NAACL.         [ Links ]

36. Toutanova, K., Suzuki, H., & Ruopp, A. (2008). Applying morphology generation models to machine translation. Proc. of the conference of the Association for Computational Linguistics and Human Language Technology (ACL-HLT), Columbus, Ohio, pp. 514-522.         [ Links ]

37. Ueffing, N. & Ney, H. (2003). Using pos information for statistical machine translation into morphologically rich languages. Proc. of the 10th conference on European chapter of the Association for Computational Linguistics (EACL), Stroudsburg, PA, USA, pp. 347-354.         [ Links ]

38. Virpioja, S., Väyrynen, J. J., Creutz, M., & Sadeniemi, M. (2007). Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. Machine Translation Summit XI, pp. 491-498.         [ Links ]

39. Way, A. & Gough, N. (2005). Comparing example-based and statistical machine translation. Natural Language Engineering, Vol. 11, No. 3, pp. 295-309.         [ Links ]

40. Zens, R., Och, F., & Ney, H. (2002). Phrase-based statistical machine translation. Proc. German Conference on Artificial Intelligence (KI), Springer Verlag.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License