SciELO - Scientific Electronic Library Online

 número41Spoken to Spoken vs. Spoken to Written: Corpus Approach to Exploring Interpreting and SubtitlingA Natural Language Dialogue System for Impression-based Music Retrieval índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados




Links relacionados

  • No hay artículos similaresSimilares en SciELO



versión On-line ISSN 1870-9044

Polibits  no.41 México ene./jun. 2010


Special section: processing of semantic information


Semi–Automatic Parallel Corpora Extraction from Comparable News Corpora


Thoudam Doren Singh and Sivaji Bandyopadhyay


Computer Science and Engineering Department, Jadavpur University, Kolkata, India. (,


Manuscript received February 15, 2010.
Manuscript accepted for publication May 31, 2010.



The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English from a comparable news corpora collected from the web. A medium sized Manipuri–English bilingual lexicon and another list of Manipuri–English transliterated entities have been developed and used in the present work. Using morphological information for the agglutinative and inflective Manipuri language, the alignment quality based on similarity measure is further improved. A high level of performance is desirable since errors in sentence alignment cause further errors in systems that use the aligned text. The system has been evaluated and error analysis has also been carried out. The technique shows its effectiveness in Manipuri–English language pair and is extendable to other resource constrained, agglutinative and inflective Indian languages.

Key words: Parallel corpora, similarity measure, bilingual lexicon, morphology, named entity list.arallel corpora, similarity measure, bilingual lexicon, morphology, named entity list.





[1] W. A. Gale and K. W. Church, "A program for aligning sentences in bilingual corpora," in Proceedings of 29th Annual Meeting ofthe Association for Computational Linguistics, Berkeley, California, 1991, pp. 177–184.         [ Links ]

[2] P. Koehn, "A parallel corpus for statistical machine translation," in In MT Summit X, 2005.         [ Links ]

[3] M. Kay and M. Roscheisen, "Text translation alignment," in Computational Linguistics, 1993, pp. 121–142.         [ Links ]

[4] P. F. Brown, J. C. Lai, and R. L. Mercer, "Aligning sentences in parallel corpora," in Proceedings of 29th Annual Meeting ofthe Association for Computational Linguistics, Berkeley, California, 1991, pp. 169–176.         [ Links ]

[5] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer, "Mathematics of statistical machine translation: Parameter estimation," in Computational Linguistics, 1993, pp. 163–311.         [ Links ]

[6] M. Simard and P. Plamondon, "Bilingual sentence alignment: Balancing robustness and accuracy," in Machine Translation, 13(1), 1998, pp. 59–80.         [ Links ]

[7] K. W. Church, "Char align: A program for aligning parallel texts at the character level," in Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1993, pp. 1–8.         [ Links ]

[8] S. F. Chen, "Aligning sentences in bilingual corpora using lexical information," in Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1993, pp. 9–16.         [ Links ]

[9] N. Collier, H. Hirakawa, and A. Kumano, "Machine translation vs. dictionary term translation – a comparison for english–japanese news article alignment," in In COLING–ACL 98, 1998, pp. 263–267.         [ Links ]

[10] K. Matsumoto and H. Tanaka, "Automatic alignment of japanese and english newspaper articles using an mt system and a bilingual company name dictionary," in In LREC–2002, 2002, pp. 480–484.         [ Links ]

[11] M. Utiyama and H. Isahara, "Reliable measures for aligning japanese–english news articles and sentences," in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics – Volume 1, Sapporo, Japan, 2003, pp. 72–79.         [ Links ]

[12] A. K. Singh and S. Husain, "Comparison, selection and use of sentence alignment algorithms for new language pairs," in Proceedings ofthe ACL–05: Association for Computational Linguistics Workshop, Ann Arbor, USA, 2005, pp. 177–184.         [ Links ]

[13] T. Utsuro, H. Ikeda, M. Yamane, Y. Matsumoto, and M. Nagao, "Bilingual text matching using bilingual dictionary and statistics," in In COLING' 94, 1994, pp. 1076–1082.         [ Links ]

[14] S. I. Singh, "Manipuri to english dictionary." Imphal, India: S. Ibetombi Devi, 2004.         [ Links ]

[15] T. D. Singh, N. Kishorjit, A. Ekbal, and S. Bandyopadhyay, "Named entity recognition for manipuri using support vector machine," in In Proceedings of PACLIC 23, Hong Kong, 2009, pp. 811–818.         [ Links ]

[16] A. Ekbal, S. K. Naskar, and S. Bandyopadhyay, "A modified joint source–channel model for transliteration," in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney: Association for Computational Linguistics, 2006, pp. 191–198.         [ Links ]

[17] T. D. Singh and S. Bandyopadhyay, "Manipuri morphological analyzer," in In the Proceedings of the Platinum Jubilee International Conference of LSI, Hyderabad, India, 2005.         [ Links ]

[18] ––––––––––, "Word class and sentence type identification in manipuri morphological analyzer," in In Proceedings of MSPIL, Mumbai, India, 2006, pp. 11–17.         [ Links ]

[19] ––––––––––, "Morphology driven manipuri pos tagger," in In Proceedings of IJCNLP–08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, 2008, pp. 91–98.         [ Links ]

[20] ––––––––––, "Manipuri–english example based machine translation system," in International Journal of Computational Linguistics and Applications (IJCLA), ISSN 0976–0962. Delhi, India: Bahri Publication, 2010, pp. 147–158.         [ Links ]

[21] P. Langlais, M. Simard, and J. Veronis, "Methods and practical issues in evaluating alignment techniques," in Proceedings of16th International Conference on Computational Linguistics (COLING–96), 1996.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons