1870-9044

S1870-90442008000100004

India

00 06 2008

37 21 30

Special section: natural language processing

Web–based Bengali News Corpus for Lexicon Development and POS Tagging

Asif Ekbal and Sivaji Bandyopadhyay

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India 700032, e–mail: asif.ekbal@gmail.com, sivaji_cse_ju@yahoo.com.

Manuscript received May 4, 2008. ]]> Manuscript accepted for publication June 12, 2008.

Abstract

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The corpus contains approximately 34 million wordforms. This corpus is used for lexicon development without employing extensive knowledge of the language. We have developed the POS taggers using Hidden Markov Model (HMM) and Support Vector Machine (SVM). The lexicon contains around 128 thousand entries and a manual check yields the accuracy of 79.6%. Initially, the POS taggers have been developed for Bengali and shown the accuracies of 85.56%, and 91.23% for HMM, and SVM, respectively. Based on the Bengali news corpus, we identify various word–level orthographic features to use in the POS taggers. The lexicon and a Named Entity Recognition (NER) system, developed using this corpus, are also used in POS tagging. The POS taggers are then evaluated with Hindi and Telugu data. Evaluation results demonstrates the fact that SVM performs better than HMM for all the three Indian languages.

Key words: Web based corpus, lexicon, part of speech (POS) tagging, hidden Markov model(HMM), support vector machine (SVM), Bengali, Hindi, Telugu.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] M. Rundell, "The Biggest Corpus of All," Humanising Language Teaching, vol. 2, no. 3, 2000. [ Links ]

[2] W. H. Fletcher, "Concordancing the Web with KWiCFinder," in Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, 23–25 March 2001. [ Links ]

[3] T. Robb, "Google as a Corpus Tool?," ETJ Journal, vol. 4, no. 1, Spring 2003. [ Links ]

[4] W. H. Fletcher, "Making the Web More Use–ful as Source for Linguists Corpora," In Ulla Conor and Thomas A. Upton (eds.), Applied Corpus Linguists: A Multidimensional Perspective, pp. 191–205, 2004. [ Links ]

[5] A. Kilgarriff and G. Grefenstette, "Introduction to the Special Issue on the Web as Corpus," Computational Linguistics, vol. 29, no. 3, pp. 333-347, 2003. [ Links ]

[6] A. Lenci, N. Bel, F. Busa, N. Calzolari, E. Gola, M. Monachini, A. Ogonowsky, I. Peters, W. Peters, N. Ruimy, M. Villegas, and A. Zampolli, "Simple: A General Framework for the Development of Multilingual Lexicons," International Journal of Lexicography, Special Issue, Dictionaries, Thesauri and Lexical–Semantic Relations, vol. XIII, no. 4, pp. 249–263, 2000. [ Links ]

[7] N. Calzolari, F. Bertagna, A. Lenci, and M. Monachini, "Standards and Best Practice for Multilingual Computational Lexicons, mile (the multilingual isle lexical entry)," ISLE Deliverable D2.2 & 3.2, 2003. [ Links ]

[8] F. Bertagna, A.Lenci, M. Monachini, and N. Calzolari, "Content interoperability of lexical resources, open issues and 'mile' perspectives," in Proceedings of the LREC 2004, pp. 131–134, 2004. [ Links ]

[9] T. Takenobou, V. Sornlertlamvanich, T. Charoenporn, N. Calzolari, M. Monachini, C. Soria, C. Huang, X. YingJu, Y. Hao, L. Prevot, and S. Kiyoaki, "Infrastructure for Standardization of Asian Languages Resources," in Proceedings of the COLING/ACL 2006, pp. 827–834, 2006. [ Links ]

[10] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, "A Practical Part–of–Speech Tagger," in Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140, 1992. [ Links ]

[11] B. Merialdo, "Tagging English Text with a Probabilistic Model," Computational Linguistics, vol. 20, no. 2, pp. 155–171, 1994. [ Links ]

[12] T. Brants, "TnT: A Statistical Part–of–Speech Tagger," in Proceedings of the sixth International Conference on Applied Natural Language Processing ANLP–2000, pp. 224–231, 2000. [ Links ]

[13] A. Ratnaparkhi, "A maximum entropy part–of–speech tagger," in Proc. of EMNLP'96., 1996. [ Links ]

[14] J. Laffertey, A. McCallum, and F. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," in Proceedings of the 18th International Conference on Machine Learning, 2001. [ Links ]

[15] T. Kudo and Y. Matsumoto, "Chunking with Support Vector Machines," in Proceedings of NAACL, pp. 192–199, 2001. [ Links ]

[16] S. Singh, K. Gupta, M. Shrivastava, and P. Bhattacharyya, "Morphological richness offsets resource demand–experiences in constructing a pos tagger for hindi," in Proceedings of the COLING/ACL 2006, pp. 779-786, 2006. [ Links ]

[17] P. Avinesh and G. Karthik, "Part Of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning," in Proceedings of IJCAI Workshop on Shallow Parsing for South Asian Languages, pp. 21–24, 2007. [ Links ]

[18] S. Dandapat, "Part Of Specch Tagging and Chunking with Maximum Entropy Model," in Proceedings ofthe IJCAI Workshop on Shallow Parsing for South Asian Languages, (Hyderabad, India), pp. 29–32, 2007. [ Links ]

[19] A. Ekbal, R. Haque, and S. Bandyopadhyay, "Maximum Entropy based Bengali Part of Speech Tagging," in A. Gelbukh (Ed.), Advances in Natural Language Processing and Applications, Research in Computing Science (RCS) Journal, vol. 33, pp. 67–78. [ Links ]

[20] A. Ekbal, R. Haque, and S. Bandyopadhyay, "Bengali Part of Speech Tagging using Conditional Random Field," in Proceedings of the seventh International Symposium on Natural Language Processing, SNLP–2007, 2007. [ Links ]

[21] A. Ekbal, R. Haque, and S. Bandyopadhyay, "Named Entity Recognition in Bengali: A Conditional Random Field Approach," in Proceedings of 3rd International Joint Conference Natural Language Processing (IJCNLP–08), pp. 589–594, 2008. [ Links ]

[22] A. Ekbal and S. Bandyopadhyay, "A Web–based Bengali News Corpus for Named Entity Recognition," Language Resources and Evaluation Journal, vol. 40, pp. 10.1007/s10579–008–9064–x, 2008. [ Links ]

[23] D. Jurafsky and J. H. Martin, Speech and Language Processing. Prentice–Hall, 2000. [ Links ]

[24] A. J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transaction on Information Theory, vol. 13, no. 2, pp. 260–267, 1967. [ Links ]

[25] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer–Verlag New York, Inc., 1995. [ Links ]

[26] C.C and V. N. Vapnik, "Support Vector Networks," Machine Learning, vol. 20, pp. 273–297, 1995. [ Links ]

[27] T. Joachims, Making Large Scale SVM Learning Practical, pp. 169–184. Cambridge, MA, USA: MIT Press, 1999. [ Links ]

[28] H. Taira and M. Haruno, "Feature Selection in SVM Text Categorization," in Proceedings ofAAAI–99, 1999. [ Links ]

]]>

2 3 3

23-2 5 Ma

2003 4 1 1

2004

191-205

2003 29 3 3

333-347

2000 XIII 4 4

249-263

2003

2004

131-134

2006

827-834

1992

133-140

1994 20 2 2

155-171

2000

224-231

1996

2001

192-199

2006

779-786

2007

21-24

2007

29-32

67-78

2007

2008

589-594

2008 40

2000

1967 13

260-267

1995

1995 20

273-297

1999

169-184

1999