SciELO - Scientific Electronic Library Online

 
 número49Una propuesta para incorporar más semántica de los modelos al código generadoSistema de medición de distancia mediante imágenes para determinar la posición de una esfera utilizando el sensor Kinect XBOX índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Polibits  no.49 México ene./jun. 2014

 

Comparison of Different Graph Distance Metrics for Semantic Text Based Classification

 

Nibaran Das1, Swarnendu Ghosh2, Teresa Gonçalves3, and Paulo Quaresma4

 

1 Computer Science and Engineering Department, Jadavpur University, Kolkata-700032, India (phone: +91 332 414 6766; fax: +91 332 414 6766; corresponding author e-mail: nibaran@ieee.org).

2 Computer Science and Engineering Department, Jadavpur University, Kolkata-700032, India.

3 Dept. of Computer Science, School of S & T, University of Évora, Évora, Portugal.

4 Dept. of Computer Science, School of S & T, University of Évora, Évora, Portugal, and with with L2F - Spoken Language Systems Laboratory, INESC-ID, Lisbon, Portugal.

 

Manuscript received on January 4, 2014
Accepted for publication on February 6, 2014.

 

Abstract

Nowadays semantic information of text is used largely for text classification task instead of bag-of-words approaches. This is due to having some limitations of bag of word approaches to represent text appropriately for certain kind of documents. On the other hand, semantic information can be represented through feature vectors or graphs. Among them, graph is normally better than traditional feature vector due to its powerful data structure. However, very few methodologies exist in the literature for semantic representation of graph. Error tolerant graph matching techniques such as graph similarity measures can be utilised for text classification. However, the techniques like Maximum Common Subgraph (mcs) and Minimum Common Supergraph (MCS) for graph similarity measures are computationally NP-hard problem. In the present paper summarized texts are used during extraction of semantic information to make it computationally faster. The semantic information of texts are represented through the discourse representation structures and later transformed into graphs. Five different graph distance measures based on Maximum Common Subgraph (mcs) and Minimum Common Supergraph (MCS) are used with k-NN classifier to evaluate text classification task. The text documents are taken from Reuters21578 text database distributed over 20 classes. Ten documents of each class for both training and testing purpose are used in the present work. From the results, it has been observed that the techniques have more or less equivalent potential to do text classification and as good as traditional bag-of-words approaches.

Key words: Graph distance metrics, maximal common subgraph, minimum common supergraphs, semantic information, text classification.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

References

[1] S. Bleik, "Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary," EEE/ACM Trans. Comput. Biol. BioinformaticsI, vol. 99, p. 1, Mar. 2013.         [ Links ]

[2] L. Zhang, Y. Li, C. Sun, and W. Nadee, "Rough Set Based Approach to Text Classification," 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 3, 2013, pp. 245-252.         [ Links ]

[3] Z. Wang and Z. Liu, "Graph-based KNN text classification," Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol. 5, 2010, pp. 2363-2366.         [ Links ]

[4] R. Angelova and G. Weikum, "Graph-based Text Classification: Learn from Your Neighbors," in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 485—192.         [ Links ]

[5] H. Kamp and U. Reyle, From Discourse to Logic: An Introduction to Model Theo-retic Semantics of Natural Language, Formal Logic and Discourse Rep-resentation Theory. Kluwer, Dordrecht: D. Reidel, 1993, p. 717.         [ Links ]

[6] "Graph Matching," in Graph Classification and Clustering Based on Vector Space Embedding, vol. Volume 77, WORLD SCIENTIFIC, 2010, pp. 15-34.         [ Links ]

[7] M. Himsolt and G. Iversitat Passau, 94030 Passau, "GML: A portable Graph File Format," 1996.         [ Links ]

[8] H. Bunke, P. Foggia, C. Guidobaldi, C. Sansone, and M. Vento, "A Comparison of Algorithms for Maximum Common Subgraph on Randomly Connected Graphs," in Structural, Syntactic, and Statistical Pattern Recognition SE - 12, vol. 2396, T. Caelli, A. Amin, R. W. Duin, D. Ridder, and M. Kamel, Eds. Springer Berlin Heidelberg, 2002, pp. 123-132.         [ Links ]

[9] J. Curran, S. Clark, and J. Bos, "Linguistically Motivated Large- Scale NLP with C&C and Boxer," in Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 2007, pp. 33-36.         [ Links ]

[10] J. Bos, "Wide-Coverage Semantic Analysis with Boxer," in Semantics in Text Processing. STEP 2008 Conference Proceedings, 2008, pp. 277-286.         [ Links ]

[11] E. L. Steven Bird, Ewan Klein, Natural Language Processing with Python. O'Reilly Media, 2009, p. 504.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons