Comparison of Different Graph Distance Metrics for Semantic Text Based Classification

Das, Nibaran; Ghosh, Swarnendu; Gonçalves, Teresa; Quaresma, Paulo

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.49 México ene./jun. 2014

Comparison of Different Graph Distance Metrics for Semantic Text Based Classification

Nibaran Das¹, Swarnendu Ghosh², Teresa Gonçalves³, and Paulo Quaresma⁴

¹ Computer Science and Engineering Department, Jadavpur University, Kolkata-700032, India (phone: +91 332 414 6766; fax: +91 332 414 6766; corresponding author e-mail: nibaran@ieee.org).

² Computer Science and Engineering Department, Jadavpur University, Kolkata-700032, India.

³ Dept. of Computer Science, School of S & T, University of Évora, Évora, Portugal.

⁴ Dept. of Computer Science, School of S & T, University of Évora, Évora, Portugal, and with with L2F - Spoken Language Systems Laboratory, INESC-ID, Lisbon, Portugal.

Manuscript received on January 4, 2014
Accepted for publication on February 6, 2014.

Abstract

Nowadays semantic information of text is used largely for text classification task instead of bag-of-words approaches. This is due to having some limitations of bag of word approaches to represent text appropriately for certain kind of documents. On the other hand, semantic information can be represented through feature vectors or graphs. Among them, graph is normally better than traditional feature vector due to its powerful data structure. However, very few methodologies exist in the literature for semantic representation of graph. Error tolerant graph matching techniques such as graph similarity measures can be utilised for text classification. However, the techniques like Maximum Common Subgraph (mcs) and Minimum Common Supergraph (MCS) for graph similarity measures are computationally NP-hard problem. In the present paper summarized texts are used during extraction of semantic information to make it computationally faster. The semantic information of texts are represented through the discourse representation structures and later transformed into graphs. Five different graph distance measures based on Maximum Common Subgraph (mcs) and Minimum Common Supergraph (MCS) are used with k-NN classifier to evaluate text classification task. The text documents are taken from Reuters21578 text database distributed over 20 classes. Ten documents of each class for both training and testing purpose are used in the present work. From the results, it has been observed that the techniques have more or less equivalent potential to do text classification and as good as traditional bag-of-words approaches.

Key words: Graph distance metrics, maximal common subgraph, minimum common supergraphs, semantic information, text classification.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

[1] S. Bleik, "Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary," EEE/ACM Trans. Comput. Biol. BioinformaticsI, vol. 99, p. 1, Mar. 2013. [ Links ]

[2] L. Zhang, Y. Li, C. Sun, and W. Nadee, "Rough Set Based Approach to Text Classification," 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 3, 2013, pp. 245-252. [ Links ]

[3] Z. Wang and Z. Liu, "Graph-based KNN text classification," Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol. 5, 2010, pp. 2363-2366. [ Links ]

[4] R. Angelova and G. Weikum, "Graph-based Text Classification: Learn from Your Neighbors," in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 485—192. [ Links ]

[5] H. Kamp and U. Reyle, From Discourse to Logic: An Introduction to Model Theo-retic Semantics of Natural Language, Formal Logic and Discourse Rep-resentation Theory. Kluwer, Dordrecht: D. Reidel, 1993, p. 717. [ Links ]

[6] "Graph Matching," in Graph Classification and Clustering Based on Vector Space Embedding, vol. Volume 77, WORLD SCIENTIFIC, 2010, pp. 15-34. [ Links ]

[7] M. Himsolt and G. Iversitat Passau, 94030 Passau, "GML: A portable Graph File Format," 1996. [ Links ]

[8] H. Bunke, P. Foggia, C. Guidobaldi, C. Sansone, and M. Vento, "A Comparison of Algorithms for Maximum Common Subgraph on Randomly Connected Graphs," in Structural, Syntactic, and Statistical Pattern Recognition SE - 12, vol. 2396, T. Caelli, A. Amin, R. W. Duin, D. Ridder, and M. Kamel, Eds. Springer Berlin Heidelberg, 2002, pp. 123-132. [ Links ]

[9] J. Curran, S. Clark, and J. Bos, "Linguistically Motivated Large- Scale NLP with C&C and Boxer," in Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 2007, pp. 33-36. [ Links ]

[10] J. Bos, "Wide-Coverage Semantic Analysis with Boxer," in Semantics in Text Processing. STEP 2008 Conference Proceedings, 2008, pp. 277-286. [ Links ]

[11] E. L. Steven Bird, Ewan Klein, Natural Language Processing with Python. O'Reilly Media, 2009, p. 504. [ Links ]