String Distances for Near-duplicate Detection

Dănăilă, Iulia; Dinu, Liviu P.; Niculae, Vlad; Sulea, Octavia-Maria

Services on Demand

Journal

SciELO Analytics
Google Scholar H5M5 ()

Article

English (pdf)
Article in xml format
Article references
How to cite this article
SciELO Analytics

Automatic translation
Send this article by e-mail

Indicators

Cited by SciELO
Access statistics

Polibits

On-line version ISSN 1870-9044

Abstract

DăNăILă, Iulia; DINU, Liviu P.; NICULAE, Vlad and SULEA, Octavia-Maria. String Distances for Near-duplicate Detection. Polibits [online]. 2012, n.45, pp.21-25. ISSN 1870-9044.

Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.

Keywords : Near-duplicate detection; string similarity measures; database; data mining.

· text in English · English (

pdf )