Reducing the Number of Canonical Form Tests for Frequent Subgraph Mining

Gago Alonso, Andrés; Carrasco Ochoa, Jesús A.; Medina Pagóla, José E.; Martínez Trinidad, José F

Serviços Personalizados

Journal

Artigo

Indicadores

Citado por SciELO
Acessos

Links relacionados

Similares em SciELO

Mais
Mais

Permalink

Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Comp. y Sist. vol.15 no.2 Ciudad de México Out./Dez. 2011

Artículos

Reducing the Number of Canonical Form Tests for Frequent Subgraph Mining

Reduciendo el número de pruebas de forma canónica para la minería de subgrafos frecuentes

Andrés Gago Alonso¹, Jesús A. Carrasco Ochoa², José E. Medina Pagóla¹, and José F. Martínez Trinidad²

¹ Data Mining Department, Advanced Technologies Application Center, La Habana, Cuba. E–mail: agago@cenatav.co.cu, jmedina@cenatav.co.cu

² Computer Science Department, National Institute of Astrophysics, Optics and Electronics, Santa María de Tonantzintla, Puebla, México. E–mail: ariel@inaoep.mx, fmartine@inaoep.mx

Article received on 12/03/2010.
Accepted 05/04/2011.

Abstract

Frequent connected subgraph (FCS) mining is an interesting problem with wide applications in real life. Most of the FCS mining algorithms have been focused on detecting duplicate candidates using canonical form tests. Canonical form tests have high computational complexity, and therefore, they affect the efficiency of graph miners. In this paper, we introduce novel properties to reduce the number of canonical form tests in FCS mining. Based on these properties, a new algorithm for FCS mining called gRed is presented. The experimentation on real world datasets shows the impact of the proposed properties on the efficiency of gRed reducing the number of canonical form tests regarding gSpan. Besides, the performance of our algorithm is compared against gSpan and other state–of–the–art algorithms.

Keywords: Data mining, frequent patterns, graph mining, frequent subgraph.

Resumen

La minería de subgrafos conexos frecuentes es un problema interesante con amplias aplicaciones en la vida práctica. La mayor parte de los algoritmos para este tipo de minería detectan los candidatos duplicados utilizando pruebas de forma canónica. Este tipo de pruebas tienen una alta complejidad computacional, lo cual afecta el desempeño de los algoritmos de minería de grafos. En este artículo se proponen nuevas propiedades para reducir el número de pruebas de forma canónica en este tipo de minería. Basado en estas propiedades, se propone un nuevo algoritmo llamado gRed. Los resultados experimentales en colecciones de datos reales muestran el impacto de las nuevas propiedades en la eficiencia de gRed, reduciendo el número de pruebas de forma canónicas con respecto a gSpan. Además, el desempeño de gRed es comparado respecto gSpan y otros algoritmos reportados en el estado del arte.

Palabras clave: Minería de datos, patrones frecuentes, minería de grafos, subgrafos frecuentes.

DESCARGAR ARTÍCULO EN FORMARTO PDF

References

1. Agrawal, R. & Srikant, R. (1994). Fast Algorithms for Mining Association Rules. In J.B. Bocca, M. Jarke & C. Zaniolo (Eds.), 20^th International Conference on Very Large Data Bases, Santiago de Chile, Chile, 487–499. [ Links ]

2. Borgelt, C. (2006). Canonical Forms for Frequent Graph Mining. In R. Decker & H.J. Lenz (Eds.), 30th Annual Conference of the Gesellschaft für Klassifikation, Berlin, Germany, 337–349. [ Links ]

3. Borgelt, C. & Berthold, M.R. (2002). Mining Molecular Fragments: Finding Relevant Substructures of Molecules. IEEE International Conference on Data Mining, Maebashi, Japan, 51–58. [ Links ]

4. Cormen, T.H., Leiserson, C.E., Rivest, R.L. & Stein, C. (2001). Introduction to Algorithm (Second edition). Cambridge, Mass.: MIT Press. [ Links ]

5. Diestel, R. (2005). Graph Theory (Third edition). Berlin: Springer. [ Links ]

6. Gago, A., Medina, J. E., Carrasco–Ochoa, J. A. & Martínez–Trinidad, J.F. (2008). Mining Frequent Connected Subgraphs Reducing the Number of Candidates. Machine Learning and Principles and Knowledge Discovery in Databases. Lecture Notes in Computer Science, 5211, 365–376. [ Links ]

7. Han, J., Cheng, H., Xin, D. & Yan, X. (2007). Frequent Pattern Mining: Current Status and Future Directions. Data Mining and Knowledge Discovery, 15(1), 55–86. [ Links ]

8. Han, S., Wee, K.N. & Yu, Y. (2007). FSP: Frequent Substructure Pattern Mining. 6th International Conference on Information, Communications and Signal Processing, Singapore, 1–5. [ Links ]

9. Hernández, J.I. (2009). Reactive Scheduling of DAG Applications on Heterogeneous and Dynamic Distributed Computing Systems, Abstract of PhD Thesis. Computacion y Sistemas, 13(2), 221–237. [ Links ]

10. Hossain, M.S. & Angryk, R A. (2007). GDClust: A Graph–based Document Clustering Technique. 7th IEEE International Conference on Data Mining Workshops, Nebraska, USA, 417–422. [ Links ]

11. Huan, J., Wang, W. & Prins, J. (2003). Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. Third IEEE International Conference on Data Mining, Florida, USA, 549–552. [ Links ]

12. Inokuchi, A., Washio, T. & Motoda, H. (2000). An Apriori based Algorithm for Mining Frequent Substructures from Graph Data. 4^t European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–23. [ Links ]

13. Inokuchi, A., Washio, T., Nishimura, K. & Motoda, H (2002). A Fast Algorithm for Mining Frequent Connected Subgraphs (RT0448). Japan: IBM Research. [ Links ]

14. Koyuturk, M., Grama, A. & Szpankowski, W. (2004). An Efficient Algorithm for Detecting Frequent Subgraphs in Biological Networks. Bioinformatics, 20(1), 200–207. [ Links ]

15. Kuramochi, M. & Karypis, G. (2001). Frequent Subgraph Discovery. IEEE International Conference on Data Mining, California, USA, 313–320. [ Links ]

16. Nijssen, S. & Kok, J.N. (2004). A Quickstart in Frequent Structure Mining can Make a Difference. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, 647–652. [ Links ]

17. Nijssen, S. & Kok, J. (2006). Frequent Subgraph Miners: Runtimes Don't Say Everything. In T. Gartner, G. C. Garriga & T. Meinl, (Eds.), Fourth Workshop on Mining and Learning with Graphs, Berlin, Germany, 173–180. [ Links ]

18. Rudin, W. (1976). Principles of Mathematical Analysis (3^rd edition). New York: McGraw–Hill. [ Links ]

19. Srinivasan, A., King, R.D., Muggleton, S.H. & Sternberg, M.J.E. (1997). The Predictive Toxicology Evaluation Challenge. 15^th International Joint Conference on Artificial Intelligence, Nagoya, Japan, 1, 4–9. [ Links ]

20. Wang, C., Wang, W., Pei, J., Zhu, Y. & Shi, B. (2004). Scalable Mining of Large Disk–based Graph Databases. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, 316–325. [ Links ]

21. Worlein, M., Dreweke, A., Meinl, T., Fischer, I. & Philippsen, M. (2006). Edgar: the Embedding–based Graph Miner. T. Gartner, G. C. Garriga & T. Meinl, (Eds.), 4^th International Workshop on Mining and Learning with Graphs, Berlin, Germany, 221–228. [ Links ]

22. Worlein, M., Meinl, T., Fischer, I. & Philippsen, M. (2005). A Quantitative Comparison of the Subgraph Miners Mofa, gSpan, FFSM, and Gaston. Knowledge Discovery in Databases: PKDD 2005. Lecture Notes in Computer Science, 3721, 392–403. [ Links ]

23. Wu, J. & Chen, L. (2008). Mining Frequent Subgraph by Incidence Matrix Normalization. Journal of Computers, 3(10), 109–115. [ Links ]

24. Yan, X. & Han, J. (2002). gSpan: Graph–Based Substructure Pattern Mining. IEEE International Conference on Data Mining (ICDM 2002), Maebashi, Japan, 721–724. [ Links ]

25. Yan, X. & Han, J. (2002). gSpan: Graph–Based Substructure Pattern Mining (UIUCDCS–R–2002–2296). Illinois, USA: University of Illinois at Urbana–Champaign. [ Links ]