Clustering XML Documents Using Structure and Content based on a New Similarity Function OverallSimSUX

Magdaleno, Damny; Fuentes, Ivett E.; García, María M.

doi:10.13053/CyS-19-1-1922

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.19 n.1 Ciudad de México Jan./Mar. 2015

https://doi.org/10.13053/CyS-19-1-1922

Artículos

Clustering XML Documents Using Structure and Content based on a New Similarity Function OverallSimSUX

Damny Magdaleno, Ivett E. Fuentes and María M. García

Computer Science Department, Universidad Central "Marta Abreu" de Las Villas (UCLV), Villa Clara, Cuba. dmg@uclv.edu.cu, ifuentes@uclv.edu.cu, mmgarcia@uclv.edu.cu.

Corresponding author is Damny Magdaleno.

Article received on 13/12/2013.
Accepted on 14/10/2014.

Abstract

Every day more digital data in semi-structured format are available on the World Wide Web, corporate intranets, and other media. Knowledge management using information search and processing is essential in the field of academic writing. This task becomes increasingly complex and defiant, mainly because collections of documents are usually heterogeneous, big, diverse, and dynamic. To resolve these challenges it is essential to improve management of time necessary to process scientific information. In this paper, we propose a new method of automatic clustering of XML documents based on their content and structure, as well as on a new similarity function OverallSimSUX which facilitates capturing the degree of similarity among documents. Evaluation of our proposal by means of experiments with data sets showed better results than those in previous work.

Keywords: Clustering, XML, structure and content, similarity.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Guerrini, G., Mesiti, M., & Sanz, I. (2006). An Overview of Similarity Measures for Clustering XML Documents. Information Systems, Athena Vakali and George Pallis (eds. [ Links ]).

2. Dalamagas, T., Cheng, T., Winkel, K.J., & Sellis, T. (2006). A Methodology for Clustering XML Documents by Structure. Information Systems, Athena Vakali and George Pallis (eds. [ Links ]).

3. Tien, T. (2007). Evaluating the Performance of XML Document Clustering by Structure only. 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. [ Links ]

4. Kruse, R., Dóring, C., & Lesor, M.J. (2007). Fundamentals of Fuzzy Clustering. Advances in Fuzzy Clustering and its Applications, J.V.d. Oliveira and W. Pedrycz (Eds.), John Wiley and Sons, England, pp. 3-27. [ Links ]

5. Wan, X. & Yang, J. (2006). Using Proportional Transportation Similarity with Learned Element Semantics for XML Document Clustering. International World Wide Web Conference Committee. [ Links ]

6. Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic text retrieval. Communications of the ACM, Vol. 18, No. 11, pp. 613-620. [ Links ]

7. Kurgan, L., Swiercz, W., & Cios, K.J. (2002). Semantic mapping of xml tags using inductive machine learning. 11th International Conference on Information and Knowledge Management, Virginia, USA. [ Links ]

8. Shen, Y. & Wang, B. (2003). Clustering schemaless xml document. 11th international conference on Cooperative Information System. [ Links ]

9. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2005). Fast detection of XML structural similarities. IEEE Trans. Knowl. Data Engin., Vol. 7, No. 2, pp. 160-175. DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.27. [ Links ]

10. Lesniewska, A. (2009). Clustering XML documents by structure. ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems. [ Links ]

11. Chawathe, S.S. (1999). Comparing Hierarchical Data in External Memory. Proceedings of International Conference on Very Large Databases. [ Links ]

12. Costa, G., Manco, G., Ortale, R., & Ritacco, E. (2013). Hierarchical clustering of XML documents focused on structural components. Data & Knowledge Engineering, Vol. 84, pp. 26-46 doi:10.1016/j.datak.2012.12.002. [ Links ]

13. Yousuke, W., Hidetaka, K., & Haruo, Y. (2013). Similarity search for office XML documents based on style and structure data. International Journal of Web Information Systems, Vol. 9 No. 2, pp. 100-117. doi:10.1145/2428736.2428761. [ Links ]

14. Aitelhadj, A., Boughanem, M., Mezghiche, M., & Souam, F. (2012). Using structural similarity for clustering XML documents. Knowledge and Information Systems, Vol. 32, No. 1, pp. 109-139. doi: 10.1007/s10115-011-0421-5. [ Links ]

15. Tran, T., Nayak, R., & Bruza, P. (2008). Combining Structure and Content Similarities for XML Document Clustering. Seventh Australasian Data Mining Conference, Glenelg, Australia. [ Links ]

16. Kutty, S., Tran, T., Nayak, R., & Li, Y. (2008). Combining the structure and content of XML documents for clustering using frequent subtrees. INEX, pp. 391-401. [ Links ]

17. Yang, W. & Chen, X.O. (2002). A semi-structured document model for text mining. Journal of Computer Science and Technology, Vol. 17, No. 5, pp. 603-610. [ Links ]

18. Tekli, J.M. & Chbeir, R. (2011). A Novel XML Document Structure Comparison Framework based-on Subtree Commonalities and Label Semantics. Vol. 11, Elsevier. [ Links ]

19. Pinto, D., Tovar, M., & Vilariño, D. (2009). BUAP: Performance of K-Star at the INEX'09 Clustering Task. INEX 2009 Workshop Pre-proceedings, Woodlands of Marburg, Ipswich, Queensland, Australia. [ Links ]

20. Shin, K. & Han, S.Y. (2003). Fast clustering algorithm for information organization. Proc. of the CICLing Conference, Lecture Notes in Computer Science, Springer-Verlag, pp. 619-622. [ Links ]

21. Magdaleno, D., Fernández, J., Huete, J., Arco, L., Fuentes, I., Artiles, M., & Bello, R. (2011). New Textual Representation using Structure and Contents. Research in Computing Science, Vol. 54, pp. 117-130. [ Links ]

22. Perez-Tellez, F., Pinto, D., Cardiff, J., & Rosso, P. (2009). Improving the Clustering of Blogosphere with a Self-term Enriching Technique. Text, Speech and Dialogue, Springer, pp. 40-47. [ Links ]

23. Ruiz-Shulcloper, J. (1995). Introducción al reconocimiento de patrones. Enfoque lógico combinatorio. México, CINVESTAV IPN. [ Links ]

24. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Proceedings of 6th ACM SIGKDD World Text Mining Conference, Boston, ACM Press. [ Links ]

25. Frakes, W.B. & Baeza-Yates, R. (1992). Information Retrieval. Data Structure & Algorithms. New York, Prentice Hall. [ Links ]

26. Vries, C., Nayak, R., Kutty, S., Geva, S., & Tagarelli, A. (2011). Overview of the INEX 2010 XML mining track: clustering and classification of XML documents. Lecture Notes in Computer Science, Vol. 6932, Springer, pp 363-376. [ Links ]

27. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, Vol. 32, No. 200, pp. 675-701. [ Links ]

28. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, Vol. 11, No. 1, pp. 86-92. [ Links ]

29. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, Vol. 1 No. 6, pp. 80-83. [ Links ]