SciELO - Scientific Electronic Library Online

 
 número37Improvement of Queries using a Rule Based Procedure for Inflection of Compounds and PhrasesMethods for Handling Spontaneous E-commerce Arabic SMS: CATS, an Operational Proof of Concept índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Resumen

EKBAL, Asif  y  BANDYOPADHYAY, Sivaji. Web-based Bengali News Corpus for Lexicon Development and POS Tagging. Polibits [online]. 2008, n.37, pp.21-30. ISSN 1870-9044.

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The corpus contains approximately 34 million wordforms. This corpus is used for lexicon development without employing extensive knowledge of the language. We have developed the POS taggers using Hidden Markov Model (HMM) and Support Vector Machine (SVM). The lexicon contains around 128 thousand entries and a manual check yields the accuracy of 79.6%. Initially, the POS taggers have been developed for Bengali and shown the accuracies of 85.56%, and 91.23% for HMM, and SVM, respectively. Based on the Bengali news corpus, we identify various word-level orthographic features to use in the POS taggers. The lexicon and a Named Entity Recognition (NER) system, developed using this corpus, are also used in POS tagging. The POS taggers are then evaluated with Hindi and Telugu data. Evaluation results demonstrates the fact that SVM performs better than HMM for all the three Indian languages.

Palabras llave : Web based corpus; lexicon; part of speech (POS) tagging; hidden Markov model(HMM); support vector machine (SVM); Bengali; Hindi; Telugu.

        · texto en Inglés     · Inglés ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons