String Distances for Near-duplicate Detection

Dănăilă, Iulia; Dinu, Liviu P.; Niculae, Vlad; Sulea, Octavia-Maria

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.45 México jun. 2012

String Distances for Near-duplicate Detection

Iulia Dănăilă¹, Liviu P. Dinu¹, Vlad Niculae¹, and Octavia-Maria Sulea²

¹ The authors are with the Faculty of Mathematics and Computer Science, University of Bucharest, Romania (e-mail: *danailaiulia@yahoo.com, **ldinu@fmi.unibuc.ro, ***vlad@vene.ro).

² Octavia-Maria Sulea is also with the Faculty of Foreign Languages and Literatures, University of Bucharest, Romania. e-mail: mary.octavia@gmail.com

Manuscript received on November 15, 2011.
accepted for publication on January 6, 2012.

Abstract

Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.

Key words: Near-duplicate detection, string similarity measures, database, data mining.

DESCARGAR ARTÍCULO EN FORMATO PDF

ACKNOWLEDGMENTS

All authors contributed equally to the work presented in this paper. The research of Liviu P. Dinu was supported by the CNCS, IDEI - PCE project 311/2011, "The Structure and Interpretation of the Romanian Nominal Phrase in Discourse Representation Theory: the Determiners."

REFERENCES

[1] G. S. Manku, A. Jain, and A. Das Sarma, "Detecting near-duplicates for web crawling," in Proceedings of the 16th international conference on World Wide Web, ser. WWW '07. New York, NY, USA: ACM, 2007, pp. 141-150. [Online]. Available: http://doi.acm.org/10.1145/1242572.1242592 [ Links ]

[2] C. Gong, Y. Huang, X. Cheng, and S. Bai, "Detecting near-duplicates in large-scale short text databases." in PAKDD'08, 2008, pp. 877-883. [ Links ]

[3] M. S. Charikar, "Similarity estimation techniques from rounding algorithms," in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ser. STOC '02. New York, NY, USA: ACM, 2002, pp. 380-388. [Online]. Available: http://doi.acm.org/10.1145/509907.509965 [ Links ]

[4] M. R. Henzinger, "Finding near-duplicate web pages: a large-scale evaluation of algorithms," in SIGIR, 2006, pp. 284-291. [ Links ]

[5] A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, "Collection statistics for fast duplicate document detection," ACM Trans. Inf. Syst, vol. 20, pp. 171-191, April 2002. [Online]. Available: http://doi.acm.org/10.1145/506309.506311 [ Links ]

[6] Q. Lv, M. Charikar, and K. Li, "Image similarity search with compact data structures," in Proceedings of the thirteenth ACM international conference on Information and knowledge management, ser. CIKM '04. New York, NY, USA: ACM, 2004, pp. 208-217. [Online]. Available: http://doi.acm.org/10.1145/1031171.1031213 [ Links ]

[7] L. Chen and F. Stentiford, "Comparison of near-duplicate image matching," in Proceedings of the 3rd European Conference on Visual Media Production, 2006. [Online]. Available: http://discovery.ucl.ac.uk/41711/ [ Links ]

[8] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, "Online dictionary learning for sparse coding," in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML '09. New York, NY, USA: ACM, 2009, pp. 689-696. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553463 [ Links ]

[9] A. Maas and A. Ng, "A probabilistic model for semantic word vectors," in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010. [ Links ]

[10] M.-M. Deza and E. Deza, Dictionary of Distances. Elsevier Science, Oct. 2006. [Online]. Available: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0444520872 [ Links ]

[11] L. P. Dinu and A. Rusu, "Rank distance aggregation as a fixed classifier combining rule for text categorization," in Proceedings of CICLing, 2010, pp. 638-647. [ Links ]

[12] L. P. Dinu, "On the classification and aggregation of hierarchies with difererent constitutive elements," Fundamenta Informaticae, vol. 55, no. 1, pp. 39-50, 2002. [ Links ]

[13] X. Tang and C. Yang, "Identifing influential users in an online healthcare social network," in Proc. IEEE Int. Conf. on Intelligence and Security Informatics, 2010 (ISI '10), May 2010. [ Links ]

[14] L. P. Dinu and A. Sgarro, "A low-complexity distance for dna strings," Fundamenta Informaticae, vol. 73, no. 3, pp. 361-372, 2006. [ Links ]

[15] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, vol. 147, pp. 195—197, 1981. [ Links ]

[16] A. E. Monge and C. P. Elkan, "Efficient domain-independent detection of approximately duplicate database records," Engineering, 1997. [Online]. Available: http://www.cecs.csulb.edu/~monge/research/vldb97.pdf [ Links ]

[17] A.-C. Achilles, "A collection of computer science bibliographies," 1996. [Online]. Available: http://liinwww.ira.uka.de/bibliography/index.html/ [ Links ]

[18] I. Oliver, Programming classics - implementing the world's best algorithms. Prentice Hall, 1994. [ Links ]