1870-9044

S1870-90442013000100005

Colombia

USA

Spain

00 07 2013

47 31 45

TopicSearch—Personalized Web Clustering Engine Using Semantic Query Expansion, Memetic Algorithms and Intelligent Agents

Carlos Cobos¹, Martha Mendoza¹, Elizabeth León², Milos Manic³, and Enrique Herrera-Viedma⁴

¹ University of Cauca, Colombia (e-mail: ccobos@unicauca.edu.co, mmendoza@unicauca.edu.co).

² Universidad Nacional de Colombia, Colombia (e-mail: eleonguz@unal.edu.co ).

³ University of Idaho at Idaho Falls, USA (e-mail: misko@uidaho.edu )

⁴ University of Granada, Spain (e-mail: viedma@decsai.ugr.es)

]]> Manuscript received on March 13, 2013.
Accepted for publication on May 23, 2013.

Abstract

As resources become more and more available on the Web, so the difficulties associated with finding the desired information increase. Intelligent agents can assist users in this task since they can search, filter and organize information on behalf of their users. Web document clustering techniques can also help users to find pages that meet their information requirements. This paper presents a personalized web document clustering called TopicSearch. TopicSearch introduces a novel inverse document frequency function to improve the query expansion process, a new memetic algorithm for web document clustering, and frequent phrases approach for defining cluster labels. Each user query is handled by an agent who coordinates several tasks including query expansion, search results acquisition, preprocessing of search results, cluster construction and labeling, and visualization. These tasks are performed by specialized agents whose execution can be parallelized in certain instances. The model was successfully tested on fifty DMOZ datasets. The results demonstrated improved precision and recall over traditional algorithms (k-means, Bisecting k-means, STC y Lingo). In addition, the presented model was evaluated by a group of twenty users with 90% being in favor of the model.

Key words: Web document clustering, intelligent agents, query expansion, WordNet, memetic algorithms, user profile.

DESCARGAR ARTÍCULO EN FORMATO PDF

ACKNOWLEDGMENT

]]> The work in this paper was supported by a Research Grant from the University of Cauca under Project VRI-2560 and the National University of Colombia.

REFERENCES

[1] C. Carpineto, et al, "A survey of Web clustering engines," ACM Comput. Surv., vol. 41, pp. 1-38, 2009. [ Links ]

[2] R. Baeza-Yates, A. and B. Ribeiro-Neto, Modern InformationRetrieval: Addison-Wesley Longman Publishing Co., Inc., 1999. [ Links ]

[3] C. Carpineto, et al, "Evaluating subtoplc retrieval methods: Clustering versus diversification of search results," Information Processing & Management, vol. 48, pp. 358-373, 2012. [ Links ]

[4] K. Hammouda, "Web Mining: Clustering Web Documents A Preliminary Review," ed, 2001, pp. 1-13. [ Links ]

[5] A. K. Jain and R. C. Dubes, Algorithms for clustering data: Prentice-Hall, Inc., 1988. [ Links ]

[6] M. Steinbach, et al, "A comparison of document clustering techniques," in KDD workshop on text mining, Boston, MA, USA., 2000, pp. 1-20. [ Links ]

[7] Y. Li, et al, "Text document clustering based on frequent word meaning sequences," Data & Knowledge Engineering, vol. 64, pp. 381-404, 2008 [ Links ]

[8] Z. Oren and E. Oren, "Web document clustering: a feasibility demonstration," presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, 1998. [ Links ]

[9] M. Mahdavi and H. Abolhassani, "Harmony K-means algorithm for document clustering," Data Mining and Knowledge Discovery, vol. 18, pp. 370-391, 2009. [ Links ]

]]>

[10] P. Berkhin, et al, "A Survey of Clustering Data Mining Techniques," in Grouping Multidimensional Data, ed: Springer-Verlag, 2006, pp. 25-71. [ Links ]

[11] S. Osiński and D. Weiss, "A concept-driven algorithm for clustering search results," Intelligent Systems, IEEE, vol. 20, pp. 48-54, 2005. [ Links ]

[12] D. Zhang and Y. Dong, "Semantic, Hierarchical, Mine Clustering of Web Search Results," in Advanced Web Technologies and Applications, ed, 2004, pp. 69-78. [ Links ]

[13] B. Fung, et al, "Hierarchical document clustering using frequent itemsets," in Proceedings of the SIAM International Conference on Data Mining, 2003, pp. 59-70. [ Links ]

[14] G Mecca, et al, "A new algorithm for clustering search results," Data & Knowledge Engineering, vol. 62, pp. 504-522, 2007. [ Links ]

]]>

[15] F. Beil, et al, "Frequent term-based text clustering," in KDD '02: International conference on Knowledge discovery and data mining (ACM SIGKDD), Edmonton, Alberta, Canada, 2002, pp. 436-442. [ Links ]

[16] L. Jing, "Survey of Text Clustering," ed, 2008. [ Links ]

[17] W. Song et al, "Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures" Expert Systems with Applications, vol. 36, pp. 9095-9104, 2009. [ Links ]

[18] L. Xiang-Wei, et al, "The research of text clustering algorithms based on frequent term sets," in Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on, 2005, pp. 2352-2356 Vol. 4. [ Links ]

[19] A. K. Jain, et al, "Data clustering: a review," ACM Comput. Surv., vol. 31, pp. 264-323, 1999. [ Links ]

]]>

[20] S. Osiński and D. Weiss, "Carrot 2: Design of a Flexible and Efficient Web Information Retrieval Framework," in Advances in Web Intelligence, ed, 2005, pp. 439-444. [ Links ]

[21] X. Wei, et al, "Document clustering based on non-negative matrix factorization," presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003. [ Links ]

[22] Z. Zhong-Yuan and J. Zhang, "Survey on the Variations and Applications of Nonnegative Matrix Factorization," ISORA'10: The Ninth International Symposium on Operations Research and Its Applications, Chengdu-Jiuzhaigou, China, 2010, pp. 317-323. [ Links ]

[23] Z. Geem, et al, "A New Heuristic Optimization Algorithm: Harmony Search," Simulation, vol. 76, pp. 60-68, 2001. [ Links ]

[24] R. Forsati, et al, "Hybridization of K-Means and Harmony Search Methods for Web Page Clustering," in WI-IAT '08: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008, pp. 329-335. [ Links ]

]]>

[25] M. Mahdavi, et al, "Novel meta-heunstic algorithms for clustering web documents," Applied Mathematics and Computation, vol. 201, pp. 441-451, 2008. [ Links ]

[26] W. Song and S. Park, "Genetic Algorithm-Based Text Clustering Technique," in Advances in Natural Computation, ed, 2006, pp. 779-782. [ Links ]

[27] C. Cobos, et al, "Web document clustering based on Global-Best Harmony Search, K-means, Frequent Term Sets and Bayesian Information Criterion," in 2010 IEEE Congress on Evolutionary Computation (CEC), Barcelona, Spain, 2010, pp. 4637-4644. [ Links ]

[28] C. Cobos, et al, "Web Document Clustering based on a New Niching Memetic Algorithm, Term-Document Matrix and Bayesian Information Criterion," in 2010 IEEE Congress on Evolutionary Computation (CEC), Barcelona, Spain, 2010, pp. 4629-4636. [ Links ]

[29] C. Manning, et al. (2008). Introduction to Information Retrieval. Available: http://www-csli.stanford.edu/~hinrich/information-retrievalbook.html [ Links ]

[30] L. Yongli, et al, "A Query Expansion Algorithm Based on Phrases Semantic Similarity," presented at the Proceedings of the 2008 International Symposiums on Information Processing, 2008. [ Links ]

[31] S. E. Robertson and K. Sparck-Jones, "Relevance weighting of search terms," in Document retrieval systems, ed: Taylor Graham Publishing, 1988, pp. 143-160. [ Links ]

[32] C. Cobos, et al, "Algoritmos de Expansión de Consulta basados en una Nueva Función Discreta de Relevancia," Revista UIS Ingenierías, vol. 10, pp. 9-22, Junio 2011. [ Links ]

[33] Q. H. Nguyen, et al, "A study on the design issues of Memetic Algorithm," in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, 2007, pp. 2390-2397. [ Links ]

[34] A. Webb, Statistical Pattern Recognition, 2nd Edition: {John Wiley & Sons}, 2002. [ Links ]

[35] S. J. Redmond and C. Heneghan, "A method for initialising the K-means clustering algorithm using kd-trees," Pattern Recognition Letters, vol. 28, pp. 965-973, 2007. [ Links ]

[36] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA, USA: The MIT Press, 1999. [ Links ]

[37] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning: Addison-Wesley Longman Publishing Co., Inc., 1989. [ Links ]

[38] T. Matsumoto and E. Hung, "Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation," in Fuzzy Systems (FUZZ), 2010 IEEE International Conference on, 2010, pp.1-8. [ Links ]

[39] E. Amigó, et al, "A comparison of extrinsic clustering evaluation metrics based on formal constraints," Inf Retr., vol. 12, pp. 461-486, 2009. [ Links ]

[40] S. Osiński, "An Algorithm for clustering of web search results," Master, Poznań University of Technology, Poland, 2003. [ Links ]

]]>

2009 41

1-38

1999

2012 48

358-373

2001

1-13

1988

2000

1-20

2008 64

381-404

1998

Melbourne

2009 18

370-391

2006

25-71

2005 20

48-54

2004

69-78

2003

59-70

2007 62

504-522

2002

436-442

2008

2009 36

9095-9104

2005 4

2352-2356

1999 31

264-323

2005

439-444

2003

Toronto

2010

317-323

2001 76

60-68

2008

329-335

2008 201

441-451

2006

779-782

2010

4637-4644

2010

4629-4636

2008

1988

143-160

Juni o 20 10

9-22

2007

2390-2397

2002 2

2007 28

965-973

1999

1989

2010

1-8

2009 12

461-486

2003