SciELO - Scientific Electronic Library Online

vol.19 issue1Evaluation of Ontological Relations in Corpora of Restricted DomainAGIS: Towards an ISO9001 based Tool for Measuring Agility author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand




Related links

  • Have no similar articlesSimilars in SciELO


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.19 n.1 México Jan./Mar. 2015 



Clustering XML Documents Using Structure and Content based on a New Similarity Function OverallSimSUX


Damny Magdaleno, Ivett E. Fuentes and María M. García


Computer Science Department, Universidad Central "Marta Abreu" de Las Villas (UCLV), Villa Clara, Cuba.,,

Corresponding author is Damny Magdaleno.


Article received on 13/12/2013.
Accepted on 14/10/2014.



Every day more digital data in semi-structured format are available on the World Wide Web, corporate intranets, and other media. Knowledge management using information search and processing is essential in the field of academic writing. This task becomes increasingly complex and defiant, mainly because collections of documents are usually heterogeneous, big, diverse, and dynamic. To resolve these challenges it is essential to improve management of time necessary to process scientific information. In this paper, we propose a new method of automatic clustering of XML documents based on their content and structure, as well as on a new similarity function OverallSimSUX which facilitates capturing the degree of similarity among documents. Evaluation of our proposal by means of experiments with data sets showed better results than those in previous work.

Keywords: Clustering, XML, structure and content, similarity.





1. Guerrini, G., Mesiti, M., & Sanz, I. (2006). An Overview of Similarity Measures for Clustering XML Documents. Information Systems, Athena Vakali and George Pallis (eds.         [ Links ]).

2. Dalamagas, T., Cheng, T., Winkel, K.J., & Sellis, T. (2006). A Methodology for Clustering XML Documents by Structure. Information Systems, Athena Vakali and George Pallis (eds.         [ Links ]).

3. Tien, T. (2007). Evaluating the Performance of XML Document Clustering by Structure only. 5th International Workshop of the Initiative for the Evaluation of XML Retrieval.         [ Links ]

4. Kruse, R., Dóring, C., & Lesor, M.J. (2007). Fundamentals of Fuzzy Clustering. Advances in Fuzzy Clustering and its Applications, J.V.d. Oliveira and W. Pedrycz (Eds.), John Wiley and Sons, England, pp. 3-27.         [ Links ]

5. Wan, X. & Yang, J. (2006). Using Proportional Transportation Similarity with Learned Element Semantics for XML Document Clustering. International World Wide Web Conference Committee.         [ Links ]

6. Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic text retrieval. Communications of the ACM, Vol. 18, No. 11, pp. 613-620.         [ Links ]

7. Kurgan, L., Swiercz, W., & Cios, K.J. (2002). Semantic mapping of xml tags using inductive machine learning. 11th International Conference on Information and Knowledge Management, Virginia, USA.         [ Links ]

8. Shen, Y. & Wang, B. (2003). Clustering schemaless xml document. 11th international conference on Cooperative Information System.         [ Links ]

9. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2005). Fast detection of XML structural similarities. IEEE Trans. Knowl. Data Engin., Vol. 7, No. 2, pp. 160-175. DOI Bookmark:         [ Links ]

10. Lesniewska, A. (2009). Clustering XML documents by structure. ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems.         [ Links ]

11. Chawathe, S.S. (1999). Comparing Hierarchical Data in External Memory. Proceedings of International Conference on Very Large Databases.         [ Links ]

12. Costa, G., Manco, G., Ortale, R., & Ritacco, E. (2013). Hierarchical clustering of XML documents focused on structural components. Data & Knowledge Engineering, Vol. 84, pp. 26-46 doi:10.1016/j.datak.2012.12.002.         [ Links ]

13. Yousuke, W., Hidetaka, K., & Haruo, Y. (2013). Similarity search for office XML documents based on style and structure data. International Journal of Web Information Systems, Vol. 9 No. 2, pp. 100-117. doi:10.1145/2428736.2428761.         [ Links ]

14. Aitelhadj, A., Boughanem, M., Mezghiche, M., & Souam, F. (2012). Using structural similarity for clustering XML documents. Knowledge and Information Systems, Vol. 32, No. 1, pp. 109-139. doi: 10.1007/s10115-011-0421-5.         [ Links ]

15. Tran, T., Nayak, R., & Bruza, P. (2008). Combining Structure and Content Similarities for XML Document Clustering. Seventh Australasian Data Mining Conference, Glenelg, Australia.         [ Links ]

16. Kutty, S., Tran, T., Nayak, R., & Li, Y. (2008). Combining the structure and content of XML documents for clustering using frequent subtrees. INEX, pp. 391-401.         [ Links ]

17. Yang, W. & Chen, X.O. (2002). A semi-structured document model for text mining. Journal of Computer Science and Technology, Vol. 17, No. 5, pp. 603-610.         [ Links ]

18. Tekli, J.M. & Chbeir, R. (2011). A Novel XML Document Structure Comparison Framework based-on Subtree Commonalities and Label Semantics. Vol. 11, Elsevier.         [ Links ]

19. Pinto, D., Tovar, M., & Vilariño, D. (2009). BUAP: Performance of K-Star at the INEX'09 Clustering Task. INEX 2009 Workshop Pre-proceedings, Woodlands of Marburg, Ipswich, Queensland, Australia.         [ Links ]

20. Shin, K. & Han, S.Y. (2003). Fast clustering algorithm for information organization. Proc. of the CICLing Conference, Lecture Notes in Computer Science, Springer-Verlag, pp. 619-622.         [ Links ]

21. Magdaleno, D., Fernández, J., Huete, J., Arco, L., Fuentes, I., Artiles, M., & Bello, R. (2011). New Textual Representation using Structure and Contents. Research in Computing Science, Vol. 54, pp. 117-130.         [ Links ]

22. Perez-Tellez, F., Pinto, D., Cardiff, J., & Rosso, P. (2009). Improving the Clustering of Blogosphere with a Self-term Enriching Technique. Text, Speech and Dialogue, Springer, pp. 40-47.         [ Links ]

23. Ruiz-Shulcloper, J. (1995). Introducción al reconocimiento de patrones. Enfoque lógico combinatorio. México, CINVESTAV IPN.         [ Links ]

24. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Proceedings of 6th ACM SIGKDD World Text Mining Conference, Boston, ACM Press.         [ Links ]

25. Frakes, W.B. & Baeza-Yates, R. (1992). Information Retrieval. Data Structure & Algorithms. New York, Prentice Hall.         [ Links ]

26. Vries, C., Nayak, R., Kutty, S., Geva, S., & Tagarelli, A. (2011). Overview of the INEX 2010 XML mining track: clustering and classification of XML documents. Lecture Notes in Computer Science, Vol. 6932, Springer, pp 363-376.         [ Links ]

27. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, Vol. 32, No. 200, pp. 675-701.         [ Links ]

28. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, Vol. 11, No. 1, pp. 86-92.         [ Links ]

29. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, Vol. 1 No. 6, pp. 80-83.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License