versión On-line ISSN 1870-9044
Polibits no.41 México ene./jun. 2010
Special section: processing of semantic information
SemiAutomatic Parallel Corpora Extraction from Comparable News Corpora
Thoudam Doren Singh and Sivaji Bandyopadhyay
Manuscript received February 15, 2010.
Manuscript accepted for publication May 31, 2010.
The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English from a comparable news corpora collected from the web. A medium sized ManipuriEnglish bilingual lexicon and another list of ManipuriEnglish transliterated entities have been developed and used in the present work. Using morphological information for the agglutinative and inflective Manipuri language, the alignment quality based on similarity measure is further improved. A high level of performance is desirable since errors in sentence alignment cause further errors in systems that use the aligned text. The system has been evaluated and error analysis has also been carried out. The technique shows its effectiveness in ManipuriEnglish language pair and is extendable to other resource constrained, agglutinative and inflective Indian languages.
Key words: Parallel corpora, similarity measure, bilingual lexicon, morphology, named entity list.arallel corpora, similarity measure, bilingual lexicon, morphology, named entity list.
 W. A. Gale and K. W. Church, "A program for aligning sentences in bilingual corpora," in Proceedings of 29th Annual Meeting ofthe Association for Computational Linguistics, Berkeley, California, 1991, pp. 177184. [ Links ]
 P. Koehn, "A parallel corpus for statistical machine translation," in In MT Summit X, 2005. [ Links ]
 M. Kay and M. Roscheisen, "Text translation alignment," in Computational Linguistics, 1993, pp. 121142. [ Links ]
 P. F. Brown, J. C. Lai, and R. L. Mercer, "Aligning sentences in parallel corpora," in Proceedings of 29th Annual Meeting ofthe Association for Computational Linguistics, Berkeley, California, 1991, pp. 169176. [ Links ]
 P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer, "Mathematics of statistical machine translation: Parameter estimation," in Computational Linguistics, 1993, pp. 163311. [ Links ]
 M. Simard and P. Plamondon, "Bilingual sentence alignment: Balancing robustness and accuracy," in Machine Translation, 13(1), 1998, pp. 5980. [ Links ]
 K. W. Church, "Char align: A program for aligning parallel texts at the character level," in Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1993, pp. 18. [ Links ]
 S. F. Chen, "Aligning sentences in bilingual corpora using lexical information," in Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1993, pp. 916. [ Links ]
 N. Collier, H. Hirakawa, and A. Kumano, "Machine translation vs. dictionary term translation a comparison for englishjapanese news article alignment," in In COLINGACL 98, 1998, pp. 263267. [ Links ]
 K. Matsumoto and H. Tanaka, "Automatic alignment of japanese and english newspaper articles using an mt system and a bilingual company name dictionary," in In LREC2002, 2002, pp. 480484. [ Links ]
 M. Utiyama and H. Isahara, "Reliable measures for aligning japaneseenglish news articles and sentences," in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics Volume 1, Sapporo, Japan, 2003, pp. 7279. [ Links ]
 A. K. Singh and S. Husain, "Comparison, selection and use of sentence alignment algorithms for new language pairs," in Proceedings ofthe ACL05: Association for Computational Linguistics Workshop, Ann Arbor, USA, 2005, pp. 177184. [ Links ]
 T. Utsuro, H. Ikeda, M. Yamane, Y. Matsumoto, and M. Nagao, "Bilingual text matching using bilingual dictionary and statistics," in In COLING' 94, 1994, pp. 10761082. [ Links ]
 S. I. Singh, "Manipuri to english dictionary." Imphal, India: S. Ibetombi Devi, 2004. [ Links ]
 T. D. Singh, N. Kishorjit, A. Ekbal, and S. Bandyopadhyay, "Named entity recognition for manipuri using support vector machine," in In Proceedings of PACLIC 23, Hong Kong, 2009, pp. 811818. [ Links ]
 A. Ekbal, S. K. Naskar, and S. Bandyopadhyay, "A modified joint sourcechannel model for transliteration," in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney: Association for Computational Linguistics, 2006, pp. 191198. [ Links ]
 T. D. Singh and S. Bandyopadhyay, "Manipuri morphological analyzer," in In the Proceedings of the Platinum Jubilee International Conference of LSI, Hyderabad, India, 2005. [ Links ]
 , "Word class and sentence type identification in manipuri morphological analyzer," in In Proceedings of MSPIL, Mumbai, India, 2006, pp. 1117. [ Links ]
 , "Morphology driven manipuri pos tagger," in In Proceedings of IJCNLP08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, 2008, pp. 9198. [ Links ]
 , "Manipurienglish example based machine translation system," in International Journal of Computational Linguistics and Applications (IJCLA), ISSN 09760962. Delhi, India: Bahri Publication, 2010, pp. 147158. [ Links ]
 P. Langlais, M. Simard, and J. Veronis, "Methods and practical issues in evaluating alignment techniques," in Proceedings of16th International Conference on Computational Linguistics (COLING96), 1996. [ Links ]