SciELO - Scientific Electronic Library Online

 
 número45A Flexible Table Parsing ApproachComparing Sanskrit Texts for Critical Editions: The Sequences Move Problem índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Polibits  no.45 México jun. 2012

 

String Distances for Near-duplicate Detection

 

Iulia Dănăilă1, Liviu P. Dinu1, Vlad Niculae1, and Octavia-Maria Sulea2

 

1 The authors are with the Faculty of Mathematics and Computer Science, University of Bucharest, Romania (e-mail: *danailaiulia@yahoo.com, **ldinu@fmi.unibuc.ro, ***vlad@vene.ro).

2 Octavia-Maria Sulea is also with the Faculty of Foreign Languages and Literatures, University of Bucharest, Romania. e-mail: mary.octavia@gmail.com

 

Manuscript received on November 15, 2011.
accepted for publication on January 6, 2012.

 

Abstract

Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.

Key words: Near-duplicate detection, string similarity measures, database, data mining.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

ACKNOWLEDGMENTS

All authors contributed equally to the work presented in this paper. The research of Liviu P. Dinu was supported by the CNCS, IDEI - PCE project 311/2011, "The Structure and Interpretation of the Romanian Nominal Phrase in Discourse Representation Theory: the Determiners."

 

REFERENCES

[1] G. S. Manku, A. Jain, and A. Das Sarma, "Detecting near-duplicates for web crawling," in Proceedings of the 16th international conference on World Wide Web, ser. WWW '07. New York, NY, USA: ACM, 2007, pp. 141-150. [Online]. Available: http://doi.acm.org/10.1145/1242572.1242592        [ Links ]

[2] C. Gong, Y. Huang, X. Cheng, and S. Bai, "Detecting near-duplicates in large-scale short text databases." in PAKDD'08, 2008, pp. 877-883.         [ Links ]

[3] M. S. Charikar, "Similarity estimation techniques from rounding algorithms," in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ser. STOC '02. New York, NY, USA: ACM, 2002, pp. 380-388. [Online]. Available: http://doi.acm.org/10.1145/509907.509965        [ Links ]

[4] M. R. Henzinger, "Finding near-duplicate web pages: a large-scale evaluation of algorithms," in SIGIR, 2006, pp. 284-291.         [ Links ]

[5] A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, "Collection statistics for fast duplicate document detection," ACM Trans. Inf. Syst, vol. 20, pp. 171-191, April 2002. [Online]. Available: http://doi.acm.org/10.1145/506309.506311        [ Links ]

[6] Q. Lv, M. Charikar, and K. Li, "Image similarity search with compact data structures," in Proceedings of the thirteenth ACM international conference on Information and knowledge management, ser. CIKM '04. New York, NY, USA: ACM, 2004, pp. 208-217. [Online]. Available: http://doi.acm.org/10.1145/1031171.1031213        [ Links ]

[7] L. Chen and F. Stentiford, "Comparison of near-duplicate image matching," in Proceedings of the 3rd European Conference on Visual Media Production, 2006. [Online]. Available: http://discovery.ucl.ac.uk/41711/        [ Links ]

[8] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, "Online dictionary learning for sparse coding," in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML '09. New York, NY, USA: ACM, 2009, pp. 689-696. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553463        [ Links ]

[9] A. Maas and A. Ng, "A probabilistic model for semantic word vectors," in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.         [ Links ]

[10] M.-M. Deza and E. Deza, Dictionary of Distances. Elsevier Science, Oct. 2006. [Online]. Available: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0444520872        [ Links ]

[11] L. P. Dinu and A. Rusu, "Rank distance aggregation as a fixed classifier combining rule for text categorization," in Proceedings of CICLing, 2010, pp. 638-647.         [ Links ]

[12] L. P. Dinu, "On the classification and aggregation of hierarchies with difererent constitutive elements," Fundamenta Informaticae, vol. 55, no. 1, pp. 39-50, 2002.         [ Links ]

[13] X. Tang and C. Yang, "Identifing influential users in an online healthcare social network," in Proc. IEEE Int. Conf. on Intelligence and Security Informatics, 2010 (ISI '10), May 2010.         [ Links ]

[14] L. P. Dinu and A. Sgarro, "A low-complexity distance for dna strings," Fundamenta Informaticae, vol. 73, no. 3, pp. 361-372, 2006.         [ Links ]

[15] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, vol. 147, pp. 195—197, 1981.         [ Links ]

[16] A. E. Monge and C. P. Elkan, "Efficient domain-independent detection of approximately duplicate database records," Engineering, 1997. [Online]. Available: http://www.cecs.csulb.edu/~monge/research/vldb97.pdf        [ Links ]

[17] A.-C. Achilles, "A collection of computer science bibliographies," 1996. [Online]. Available: http://liinwww.ira.uka.de/bibliography/index.html/        [ Links ]

[18] I. Oliver, Programming classics - implementing the world's best algorithms. Prentice Hall, 1994.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons