Services on Demand
Journal
Article
Indicators
- Cited by SciELO
- Access statistics
Related links
- Similars in SciELO
Share
Polibits
On-line version ISSN 1870-9044
Polibits n.39 México Jan./Jun. 2009
Articles
Disentangling the Wikipedia Category Graph for Corpus Extraction
AxelCyrille Ngonga Ngomo and Frank Schumacher
Department of Computer Science, University of Leipzig, Johannisalle 23, Room 522, 04103 Leipzig, Germany; email: ngonga@informatik.unileipzig.de
Manuscript received February 5, 2009.
Manuscript accepted for publication March 20, 2009.
Abstract
In several areas of research such as knowledge management and natural language processing, domainspecific corpora are required for tasks such as terminology extraction and ontology learning. The presented investigations herein are based on the assumption that Wikipedia can be used for the purpose of corpus extraction. It presents the advantage of possessing a semantic layer, which should ease the extraction of domainspecific corpora. Yet, as the Wikipedia category graph is scalefree, it can not be used as it is for these purposes. In this paper, we propose a novel approach to graph clustering called BorderFlow, which we use and evaluate on the Wikipedia category graph. Additional possible applications of these results in the area of information retrieval are presented.
Key words: Natural language processing, local graph clustering, corpus extraction.
DESCARGAR ARTÍCULO EN FORMATO PDF
REFERENCES
[1] R. BaezaYates and B. RibeiroNeto. Modern Information Retrieval. ACM Press/AddisonWesley Longman Publishing Co., Harlow, England, 1999. [ Links ]
[2] C. Biemann. Chinese whispers an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the HLTNAACL06 Workshop on Textgraphs06, New York, USA, 2006. [ Links ]
[3] G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150160, Boston, MA, August 2023 2000. [ Links ]
[4] G. Gan, C. Ma, and J. Wu. Data Clustering: Theory, Algorithms, and Applications (ASASIAM Series on Statistics and Applied Probability). SIAM, 2007. [ Links ]
[5] C. Jacquemin, J. Klavans, and E. Tzoukermann. Expansion of multiword terms for indexing and retrieval using morphology and syntax. In Proceeding of 35th ACL, pages 2431, 1997. [ Links ]
[6] R. Kannan, S. Vampala, and A. Vetta. On clustering: good, bad and spectral. In Proceedings of 41st Annual Symposium on Foundations of Computer Science, pages 367378, New York, USA, 2000. [ Links ]
[7] A. Maguitman, D. Leake, T. Reichherzer, and F. Menczer. Dynamic extraction topic descriptors and discriminators: towards automatic contextbased topic search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 463472, New York, NY, USA, 2004. ACM. [ Links ]
[8] A.C. Ngonga Ngomo and F. Schumacher. Involving the user in semantic search. In Michael J. Smith and Gavriel Salvendy, editors, [ Links ] HCI (8), volume 4557 of Lecture Notes in Computer Science, pages 507516. Springer, 2007. [ Links ]
[9] V. Kumar P.N. Tan, M. Steinbach. Introduction to Data Mining, (First Edition). AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 2005. [ Links ]
[10] P. Pantel. Clustering by Committee. PhD thesis, University of Alberta, Edmonton, Alberta, Canada, 2003. [ Links ]
[11] S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000. [ Links ]
[12] T. Zesch and I. Gurevych. Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs2 Workshop (NAACLHLT 2007), pages 18, 2007. [ Links ]