Disentangling the Wikipedia Category Graph for Corpus Extraction

Ngonga Ngomo, Axel-Cyrille; Schumacher, Frank

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Polibits

On-line version ISSN 1870-9044

Polibits n.39 México Jan./Jun. 2009

Articles

Disentangling the Wikipedia Category Graph for Corpus Extraction

Axel–Cyrille Ngonga Ngomo and Frank Schumacher

Department of Computer Science, University of Leipzig, Johannisalle 23, Room 5–22, 04103 Leipzig, Germany; e–mail: ngonga@informatik.uni–leipzig.de

Manuscript received February 5, 2009.
Manuscript accepted for publication March 20, 2009.

Abstract

In several areas of research such as knowledge management and natural language processing, domain–specific corpora are required for tasks such as terminology extraction and ontology learning. The presented investigations herein are based on the assumption that Wikipedia can be used for the purpose of corpus extraction. It presents the advantage of possessing a semantic layer, which should ease the extraction of domain–specific corpora. Yet, as the Wikipedia category graph is scale–free, it can not be used as it is for these purposes. In this paper, we propose a novel approach to graph clustering called BorderFlow, which we use and evaluate on the Wikipedia category graph. Additional possible applications of these results in the area of information retrieval are presented.

Key words: Natural language processing, local graph clustering, corpus extraction.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] R. Baeza–Yates and B. Ribeiro–Neto. Modern Information Retrieval. ACM Press/Addison–Wesley Longman Publishing Co., Harlow, England, 1999. [ Links ]

[2] C. Biemann. Chinese whispers – an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the HLT–NAACL–06 Workshop on Textgraphs–06, New York, USA, 2006. [ Links ]

[3] G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–160, Boston, MA, August 20–23 2000. [ Links ]

[4] G. Gan, C. Ma, and J. Wu. Data Clustering: Theory, Algorithms, and Applications (ASA–SIAM Series on Statistics and Applied Probability). SIAM, 2007. [ Links ]

[5] C. Jacquemin, J. Klavans, and E. Tzoukermann. Expansion of multiword terms for indexing and retrieval using morphology and syntax. In Proceeding of 35th ACL, pages 24–31, 1997. [ Links ]

[6] R. Kannan, S. Vampala, and A. Vetta. On clustering: good, bad and spectral. In Proceedings of 41st Annual Symposium on Foundations of Computer Science, pages 367–378, New York, USA, 2000. [ Links ]

[7] A. Maguitman, D. Leake, T. Reichherzer, and F. Menczer. Dynamic extraction topic descriptors and discriminators: towards automatic context–based topic search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 463–472, New York, NY, USA, 2004. ACM. [ Links ]

[8] A.–C. Ngonga Ngomo and F. Schumacher. Involving the user in semantic search. In Michael J. Smith and Gavriel Salvendy, editors, [ Links ] HCI (8), volume 4557 of Lecture Notes in Computer Science, pages 507–516. Springer, 2007. [ Links ]

[9] V. Kumar P.–N. Tan, M. Steinbach. Introduction to Data Mining, (First Edition). Addison–Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005. [ Links ]

[10] P. Pantel. Clustering by Committee. PhD thesis, University of Alberta, Edmonton, Alberta, Canada, 2003. [ Links ]

[11] S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000. [ Links ]

[12] T. Zesch and I. Gurevych. Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs–2 Workshop (NAACL–HLT 2007), pages 1–8, 2007. [ Links ]