Detecting Derivatives using Specific and Invariant Descriptors

Poulard, Fabien; Hernandez, Nicolás; Daille, Béatrice

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.43 México ene./jun. 2011

Detecting Derivatives using Specific and Invariant Descriptors

Fabien Poulard, Nicolás Hernandez, and Béatrice Daille

University of Nantes / LINA (CNRS – UMR 6241), 2 rue de la Houssiniere, B.P. 92208, 44322 Nantes Cedex 3, France (e–mail: first.last@univ–nantes.fr).

Manuscript received November 9, 2010.
Manuscript accepted for publication January 15, 2011.

Abstract

This paper explores the detection of derivation links between texts (otherwise called plagiarism, near–duplication, revision, etc.) at the document level. We evaluate the use of textual elements implementing the ideas of specificity and invariance as well as their combination to characterize derivatives. We built a French press corpus based on Wikinews revisions to run this evaluation. We obtain performances similar to the state of the art method (n–grams overlap) while reducing the signature size and so, the processing costs. In order to ensure the verifiability and the reproducibility of our results we make our code as well as our corpus available to the community.

Key words: Textual derivatives, detection of derivations, near–duplicates, revisions, linguistic descriptors, French corpus.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] S. M. Z. Eissen and B. Stein, "Intrinsic plagiarism detection," in Proceedings of the 28th European Conference on IR Research (ECIR 2006), 2006, pp. 565–569. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.5366. [ Links ]

[2] A. Aizawa, "Analysis of source identified text corpora: exploring the statistics of the reused text and authorship," in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, 2003, pp. 383–390. [Online]. Available: http://portal.acm.org/citation.cfm?id=1075145. [ Links ]

[3] N. Shivakumar and H. Garcia–molina, "Building a scalable and accurate copy detection mechanism," in Proceedings of the 1st ACM International Conference on Digital Libraries (DL 1996), 1996, pp. 160–168. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.6064. [ Links ]

[4] P. Clough, "Measuring text reuse," Ph.D. dissertation, University of Sheffield, mar 2003. [ Links ]

[5] C. Lyon, R. Barrett, and J. Malcolm, "Plagiarism is easy, but also easy to detect," Plagiary, vol. 1, pp. 1–10, 2006. [ Links ]

[6] A. Z. Broder, "On the resemblance and containment of documents," in Compression and Complexity of SEQUENCES 1997, 1997, pp. 21–29. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.779. [ Links ]

[7] N. Heintze, "Scalable document fingerprinting (Extended abstract)," http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html, 1996. [Online]. Available: http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html. [ Links ]

[8] U. Manber, "Finding similar files in a large file system," in Proceedings of the USENIX Winter 1994 Technical Conference, October 1994, p. 1–10. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3222&rep=rep1&type=pdf. [ Links ]

[9] S. Brin, J. Davis, and H. Garcia–molina, "Copy detection mechanisms for digital documents," in Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD 1995), 1995, pp. 398—409. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.8485. [ Links ]

[10] M. Henzinger, "Finding near–duplicate web pages," in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval – SIGIR '06, E. N. Efthimiadis, S. T. Dumais, D. Hawking, and J. e. Kalervo, Eds. ACM, 2006, p. 284. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1148170.1148222. [ Links ]

[11] Y. Bernstein, M. Shokouhi, and J. Zobel, "Compact features for detection of near–duplicates in distributed retrieval," in Proceedings of the Symposium on String Processing and Information Retrieval, 2006, pp. 110–121. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.88.3243. [ Links ]

[12] R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel, P. Clough, and S. S. L. Piao, "The meter corpus: a corpus for analysing journalistic text reuse," in Proceedings of the 2001 Corpus Linguistics Conference, 2001, pp. 214–223. [Online]. Available: http://nlp.shef.ac.uk/meter/. [ Links ]

[13] H. Yang, "Next steps in near–duplicate detection for erulemaking," in Proceedings of the 7th Annual International Conference on Digital Government Research (DG.O 2006), 2006, pp. 239–248. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.3732. [ Links ]

[14] M. Potthast, B. Stein, and P. Rosso, "An evaluation framework for plagiarism detection," in Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, 2010. [ Links ]

[15] K. W. Church and W. A. Gale, "Inverse document frequency (IDF): A measure of deviations from poisson," in Proceedings of the Third Workshop on Very Large Corpora, 1995, p. 121–130. [ Links ]

[16] N. Fourour, E. Morin, and B. Daille, "Incremental recognition and referential categorization of french proper names," in Proceedings of the Third International Conference on Language Ressources and Evaluation (LREC 2002), vol. 3, 2002, pp. 1068–1074. [ Links ]

[17] F. Cerbah, "Exogeneous and endogeneous approaches to semantic categorization of unknown technical terms," in Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), 2000, pp. 145–151. [ Links ]

[18] B. Daille, "Conceptual structuring through term variations," in Proceedings ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, 2003, pp. 9–16. [ Links ]

[19] A. Abeille, L. Clement, and F. Toussenel, Building a tree bank for French. Kluwer Academic Publishers, 2003, pp. 165–187. [ Links ]

[20] T. C. Hoad and J. Zobel, "Methods for identifying versioned and plagiarised documents," Journal of the American Society for Information Science and Technology, vol. 54, pp. 203—215 , 2002. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680. [ Links ]

[21] D. Metzler, Y. Bernstein, B. W. Croft, A. Moffat, and J. Zobel, "Similarity measures for tracking information flow," in CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management. New York, NY, USA: ACM, 2005, pp. 517–524. [ Links ]