SciELO - Scientific Electronic Library Online

 
 issue38Modeling a Quite Different Machine Translation using Lexical Conceptual Structure3D Visualization of Deformation and Cut of Virtual Objects based on Orthogonal Decomposition author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Polibits

On-line version ISSN 1870-9044

Polibits  n.38 México Jul./Dec. 2008

 

Special section: natural language processing

 

Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

 

Sujan Kumar Saha1, Partha Sarathi Ghosh2, Sudeshna Sarkar3, and Pabitra Mitra4

 

1 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: sujan.kr.saha@gmail.com).

2 HCL Technologies, Bangalore, India (email: partha.silicon@gmail.com).

3 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: shudeshna@gmail.com).

4 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: pabitra@gmail.com).

 

Manuscript received July 10, 2008.
Manuscript accepted for publication October 22, 2008.

 

Abstract

Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource–poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a two–phase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.

Key words: Gazetteer list preparation, named entity recognition, natural language processing, transliteration.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

REFERENCES

[1] Al–Onaizan Y. and Knight K. 2002. Machine Transliteration of Names in Arabic Text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages.         [ Links ]

[2] Bikel D. M., Miller S, Schwartz R and Weischedel R. 1997. Nymble: A high performance learning name–finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201.         [ Links ]

[3] Borthwick A. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New Fork University.         [ Links ]

[4] Crego J. M., Marino J. B. and Gispert A. 2005. Reordered Search and Tuple Unfolding for Ngram–based SMT. In: Proceedings of the MT–SummitX, Phuket, Thailand, pp. 283–289.         [ Links ]

[5] Cucerzan S. and Yarowsky D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the Joint SIGDAT Conference on EMNLP and VLC 1999, pp. 90–99.         [ Links ]

[6] Darroch J. N. and Ratcliff D. 1972. Generalized iterative scaling for log–linear models. Annals of Mathematical Statistics, pp. 43(5):1470–1480.         [ Links ]

[7] Ekbal A., Naskar S. and Bandyopadhyay S. 2006. A Modified Joint Source Channel Model for Transliteration. In Proceedings of the COLING/ACL 2006, Australia, pp. 191–198.         [ Links ]

[8] Goto I., Kato N., Uratani N. and Ehara T. 2003. Transliteration considering Context Information based on the Maximum Entropy Method. In: Proceeding of the MT–Summit IX, New Orleans, USA, pp. 125–132.         [ Links ]

[9] Grishman R. 1995. Where's the syntax? The New York University MUC–6 System. In: Proceedings of the Sixth Message Understanding Conference.         [ Links ]

[10] Knight K. and Graehl J. 1998. Machine Transliteration. Computational Linguistics, 24(4): 599–612.         [ Links ]

[11] Li H., Zhang M. and Su J. 2004. A Joint Source–Channel Model for Machine Transliteration. In: Proceedings of the 42nd Annual Meeting of the ACL, Barcelona, Spain, (2004), pp. 159–166.         [ Links ]

[12] Li W. and McCallum A. 2003. Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction. In: ACM Transactions on Asian Language Information Processing (TALIP), 2(3): 290–294.         [ Links ]

[13] McDonald D. 1996. Internal and external evidence in the identification and semantic categorization of proper names. In: B.Boguraev and J. Pustejovsky (eds), Corpus Processing for Lexical Acquisition, pp. 21–39.         [ Links ]

[14] Mikheev A, Grover C. and Moens M. 1998. Description of the LTG system used for MUC–7. In Proceedings of the Seventh Message Understanding Conference.         [ Links ]

[15] Pietra S. D., Pietra V. D. and Lafferty J. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 19(4): 380–393.         [ Links ]

[16] Saha S. K., Mitra P. and Sarkar S. 2008. Word Clustering and Word Selection based Feature Reduction for MaxEnt based Hindi NER. In: proceedings of ACL–08: HLT, pp. 488–495.         [ Links ]

[17] Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and Sub–Type Tagging. In: Proceedings of the sixth conference on applied natural language processing.         [ Links ]

[18] Wakao T., Gaizauskas R. and Wilks Y. 1996. Evaluation of an algorithm for the recognition and classification of proper names. In: Proceedings of COLING–96        [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License