<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1870-9044</journal-id>
<journal-title><![CDATA[Polibits]]></journal-title>
<abbrev-journal-title><![CDATA[Polibits]]></abbrev-journal-title>
<issn>1870-9044</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Innovación y Desarrollo Tecnológico en Cómputo]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1870-90442011000100001</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Detecting Derivatives using Specific and Invariant Descriptors]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Poulard]]></surname>
<given-names><![CDATA[Fabien]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Hernandez]]></surname>
<given-names><![CDATA[Nicolás]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Daille]]></surname>
<given-names><![CDATA[Béatrice]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,University of Nantes Laboratoire Informatique de Nantes Atlantique ]]></institution>
<addr-line><![CDATA[Nantes ]]></addr-line>
<country>France</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>06</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>06</month>
<year>2011</year>
</pub-date>
<numero>43</numero>
<fpage>7</fpage>
<lpage>13</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1870-90442011000100001&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1870-90442011000100001&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1870-90442011000100001&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[This paper explores the detection of derivation links between texts (otherwise called plagiarism, near-duplication, revision, etc.) at the document level. We evaluate the use of textual elements implementing the ideas of specificity and invariance as well as their combination to characterize derivatives. We built a French press corpus based on Wikinews revisions to run this evaluation. We obtain performances similar to the state of the art method (n-grams overlap) while reducing the signature size and so, the processing costs. In order to ensure the verifiability and the reproducibility of our results we make our code as well as our corpus available to the community.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Textual derivatives]]></kwd>
<kwd lng="en"><![CDATA[detection of derivations]]></kwd>
<kwd lng="en"><![CDATA[near-duplicates]]></kwd>
<kwd lng="en"><![CDATA[revisions]]></kwd>
<kwd lng="en"><![CDATA[linguistic descriptors]]></kwd>
<kwd lng="en"><![CDATA[French corpus]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[  	    <p align="center"><font face="verdana" size="4"><b>Detecting Derivatives using Specific and Invariant Descriptors</b></font></p> 	    <p align="center"><font face="verdana" size="2">&nbsp;</font></p> 	    <p align="center"><font face="verdana" size="2"><b>Fabien Poulard, Nicol&aacute;s Hernandez, and B&eacute;atrice Daille</b></font></p> 	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p> 	    <p align="justify"><font face="verdana" size="2"><i>University of Nantes / LINA (CNRS &#150; UMR 6241), 2 rue de la Houssiniere, B.P. 92208, 44322 Nantes Cedex 3, France (e&#150;mail:</i> <a href="mailto:first.last@univ&#150;nantes.fr">first.last@univ&#150;nantes.fr</a>).</font></p> 	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p> 	    <p align="justify"><font face="verdana" size="2">Manuscript received November 9, 2010.    <br>     Manuscript accepted for publication January 15, 2011.</font></p> 	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p> 	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><b>Abstract</b></font></p> 	    <p align="justify"><font face="verdana" size="2">This paper explores the detection of derivation links between texts (otherwise called plagiarism, near&#150;duplication, revision, etc.) at the document level. We evaluate the use of textual elements implementing the ideas of specificity and invariance as well as their combination to characterize derivatives. We built a French press corpus based on Wikinews revisions to run this evaluation. We obtain performances similar to the state of the art method (n&#150;grams overlap) while reducing the signature size and so, the processing costs. In order to ensure the verifiability and the reproducibility of our results we make our code as well as our corpus available to the community.</font></p> 	    <p align="justify"><font face="verdana" size="2"><b>Key words</b>: Textual derivatives, detection of derivations, near&#150;duplicates, revisions, linguistic descriptors, French corpus.</font></p> 	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p> 	    <p align="justify"><font face="verdana" size="2"><a href="/pdf/poli/n43/n43a1.pdf" target="_blank">DESCARGAR ART&Iacute;CULO EN FORMATO PDF</a></font></p> 	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p> 	    <p align="justify"><font face="verdana" size="2"><b>REFERENCES</b></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;1&#93; S. M. Z. Eissen and B. Stein, "Intrinsic plagiarism detection," in <i>Proceedings of the 28th European Conference on IR Research (ECIR 2006),</i> 2006, pp. 565&#150;569. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.5366" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.5366</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045012&pid=S1870-9044201100010000100001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;2&#93; A. Aizawa, "Analysis of source identified text corpora: exploring the statistics of the reused text and authorship," in <i>Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,</i> vol. 1, 2003, pp. 383&#150;390. &#91;Online&#93;. Available: <a href="http://portal.acm.org/citation.cfm?id=1075145" target="_blank">http://portal.acm.org/citation.cfm?id=1075145</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045014&pid=S1870-9044201100010000100002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;3&#93; N. Shivakumar and H. Garcia&#150;molina, "Building a scalable and accurate copy detection mechanism," in <i>Proceedings of the 1st ACM International Conference on Digital Libraries (DL 1996),</i> 1996, pp. 160&#150;168. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.6064" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.6064</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045016&pid=S1870-9044201100010000100003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;4&#93; P. Clough, "Measuring text reuse," Ph.D. dissertation, University of Sheffield, mar 2003.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045018&pid=S1870-9044201100010000100004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;5&#93; C. Lyon, R. Barrett, and J. Malcolm, "Plagiarism is easy, but also easy to detect," <i>Plagiary,</i> vol. 1, pp. 1&#150;10, 2006.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045020&pid=S1870-9044201100010000100005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;6&#93; A. Z. Broder, "On the resemblance and containment of documents," in <i>Compression and Complexity of SEQUENCES 1997,</i> 1997, pp. 21&#150;29. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.779" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=</a><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.779">10.1.1.24.779</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045022&pid=S1870-9044201100010000100006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;7&#93; N. Heintze, "Scalable document fingerprinting (Extended abstract)," <a href="http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html" target="_blank">http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html</a>, 1996. &#91;Online&#93;. Available: <a href="http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html" target="_blank">http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045024&pid=S1870-9044201100010000100007&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;8&#93; U. Manber, "Finding similar files in a large file system," in <i>Proceedings of the USENIX Winter 1994 Technical Conference,</i> October 1994, p. 1&#150;10. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3222&rep=rep1&type=pdf" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3222&amp;rep=rep1&amp;type=pdf</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045026&pid=S1870-9044201100010000100008&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;9&#93; S. Brin, J. Davis, and H. Garcia&#150;molina, "Copy detection mechanisms for digital documents," in <i>Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD 1995),</i> 1995, pp. 398&#151;409. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.8485" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.8485</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045028&pid=S1870-9044201100010000100009&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;10&#93; M. Henzinger, "Finding near&#150;duplicate web pages," in <i>Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval &#150; SIGIR '06,</i> E. N. Efthimiadis, S. T. Dumais, D. Hawking, and J. e. Kalervo, Eds. ACM, 2006, p. 284. &#91;Online&#93;. Available: <a href="http://portal.acm.org/citation.cfm?doid=1148170.1148222" target="_blank">http://portal.acm.org/citation.cfm?doid=</a><a href="http://portal.acm.org/citation.cfm?doid=1148170.1148222">1148170.1148222</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045030&pid=S1870-9044201100010000100010&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;11&#93; Y. Bernstein, M. Shokouhi, and J. Zobel, "Compact features for detection of near&#150;duplicates in distributed retrieval," in <i>Proceedings of the Symposium on String Processing and Information Retrieval,</i> 2006, pp. 110&#150;121. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.88.3243" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.88.3243</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045032&pid=S1870-9044201100010000100011&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;12&#93; R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel, P. Clough, and S. S. L. Piao, "The meter corpus: a corpus for analysing journalistic text reuse," in <i>Proceedings of the 2001 Corpus Linguistics Conference,</i> 2001, pp. 214&#150;223. &#91;Online&#93;. Available: <a href="http://nlp.shef.ac.uk/meter/" target="_blank">http://nlp.shef.ac.uk/meter/</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045034&pid=S1870-9044201100010000100012&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;13&#93; H. Yang, "Next steps in near&#150;duplicate detection for erulemaking," in <i>Proceedings of the 7th Annual International Conference on Digital Government Research (DG.O 2006),</i> 2006, pp. 239&#150;248. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.3732" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=</a><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.3732">10.1.1.111.3732</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045036&pid=S1870-9044201100010000100013&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;14&#93; M. Potthast, B. Stein, and P. Rosso, "An evaluation framework for plagiarism detection," in <i>Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010,</i> 2010.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045038&pid=S1870-9044201100010000100014&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;15&#93; K. W. Church and W. A. Gale, "Inverse document frequency (IDF): A measure of deviations from poisson," in <i>Proceedings of the Third Workshop on Very Large Corpora,</i> 1995, p. 121&#150;130.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045040&pid=S1870-9044201100010000100015&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;16&#93; N. Fourour, E. Morin, and B. Daille, "Incremental recognition and referential categorization of french proper names," in <i>Proceedings of the Third International Conference on Language Ressources and Evaluation (LREC 2002),</i> vol. 3, 2002, pp. 1068&#150;1074.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045042&pid=S1870-9044201100010000100016&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;17&#93; F. Cerbah, "Exogeneous and endogeneous approaches to semantic categorization of unknown technical terms," in <i>Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), </i>2000, pp. 145&#150;151.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045044&pid=S1870-9044201100010000100017&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;18&#93; B. Daille, "Conceptual structuring through term variations," in <i>Proceedings ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment,</i> 2003, pp. 9&#150;16.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045046&pid=S1870-9044201100010000100018&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;19&#93; A. Abeille, L. Clement, and F. Toussenel, <i>Building a tree bank for French. </i>Kluwer Academic Publishers, 2003, pp. 165&#150;187.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045048&pid=S1870-9044201100010000100019&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;20&#93; T. C. Hoad and J. Zobel, "Methods for identifying versioned and plagiarised documents," <i>Journal of the American Society for Information Science and Technology,</i> vol. 54, pp. 203&#151;215 , 2002. &#91;Online&#93;. Available: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045050&pid=S1870-9044201100010000100020&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p> 	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;21&#93; D. Metzler, Y. Bernstein, B. W. Croft, A. Moffat, and J. Zobel, "Similarity measures for tracking information flow," in <i>CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management.</i> New York, NY, USA: ACM, 2005, pp. 517&#150;524.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6045052&pid=S1870-9044201100010000100021&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>      ]]></body><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Eissen]]></surname>
<given-names><![CDATA[S. M. Z.]]></given-names>
</name>
<name>
<surname><![CDATA[Stein]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Intrinsic plagiarism detection]]></article-title>
<source><![CDATA[Proceedings of the 28th European Conference on IR Research (ECIR 2006)]]></source>
<year>2006</year>
<page-range>565-569</page-range></nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Aizawa]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Analysis of source identified text corpora: exploring the statistics of the reused text and authorship,]]></article-title>
<source><![CDATA[Proceedings of the 41st Annual Meeting on Association for Computational Linguistics]]></source>
<year>2003</year>
<volume>1</volume>
<page-range>383-390</page-range></nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Shivakumar]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Garcia-molina]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Building a scalable and accurate copy detection mechanism,]]></article-title>
<source><![CDATA[Proceedings of the 1st ACM International Conference on Digital Libraries (DL 1996)]]></source>
<year>1996</year>
<page-range>160-168</page-range></nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Clough]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Measuring text reuse,]]></article-title>
<source><![CDATA[Ph.D. dissertation]]></source>
<year>mar </year>
<month>20</month>
<day>03</day>
<publisher-name><![CDATA[University of Sheffield]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lyon]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Barrett]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Malcolm]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Plagiarism is easy, but also easy to detect,]]></article-title>
<source><![CDATA[Plagiary]]></source>
<year>2006</year>
<volume>1</volume>
<page-range>1-10</page-range></nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Broder]]></surname>
<given-names><![CDATA[A. Z.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[On the resemblance and containment of documents,]]></article-title>
<source><![CDATA[Compression and Complexity of SEQUENCES 1997]]></source>
<year>1997</year>
<page-range>21-29</page-range></nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Heintze]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Scalable document fingerprinting (Extended abstract),]]></article-title>
<source><![CDATA[]]></source>
<year>1996</year>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Manber]]></surname>
<given-names><![CDATA[U.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Finding similar files in a large file system,]]></article-title>
<source><![CDATA[Proceedings of the USENIX Winter 1994 Technical Conference]]></source>
<year>Octo</year>
<month>be</month>
<day>r </day>
<page-range>1-10</page-range></nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brin]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Davis]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Garcia-molina]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Copy detection mechanisms for digital documents,]]></article-title>
<source><![CDATA[Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD 1995)]]></source>
<year>1995</year>
<page-range>398-409</page-range></nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Henzinger]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Finding near-duplicate web pages,]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Efthimiadis]]></surname>
<given-names><![CDATA[E. N.]]></given-names>
</name>
<name>
<surname><![CDATA[Dumais]]></surname>
<given-names><![CDATA[S. T.]]></given-names>
</name>
<name>
<surname><![CDATA[Hawking]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Kalervo]]></surname>
<given-names><![CDATA[J. e.]]></given-names>
</name>
</person-group>
<source><![CDATA[Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06]]></source>
<year>2006</year>
<page-range>284</page-range><publisher-name><![CDATA[ACM]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bernstein]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Shokouhi]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Zobel]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Compact features for detection of near-duplicates in distributed retrieval,]]></article-title>
<source><![CDATA[Proceedings of the Symposium on String Processing and Information Retrieval]]></source>
<year>2006</year>
<page-range>110-121</page-range></nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gaizauskas]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Foster]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Wilks]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Arundel]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Clough]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Piao]]></surname>
<given-names><![CDATA[S. S. L.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The meter corpus: a corpus for analysing journalistic text reuse,]]></article-title>
<source><![CDATA[Proceedings of the 2001 Corpus Linguistics Conference]]></source>
<year>2001</year>
<page-range>214-223</page-range></nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yang]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Next steps in near-duplicate detection for erulemaking,]]></article-title>
<source><![CDATA[Proceedings of the 7th Annual International Conference on Digital Government Research (DG.O 2006)]]></source>
<year>2006</year>
<page-range>239-248</page-range></nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Potthast]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Stein]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Rosso]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[An evaluation framework for plagiarism detection,]]></article-title>
<source><![CDATA[Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010]]></source>
<year>2010</year>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Church]]></surname>
<given-names><![CDATA[K. W.]]></given-names>
</name>
<name>
<surname><![CDATA[Gale]]></surname>
<given-names><![CDATA[W. A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Inverse document frequency (IDF): A measure of deviations from poisson,]]></article-title>
<source><![CDATA[Proceedings of the Third Workshop on Very Large Corpora]]></source>
<year>1995</year>
<page-range>121-130</page-range></nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fourour]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Morin]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Daille]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Incremental recognition and referential categorization of french proper names,]]></article-title>
<source><![CDATA[Proceedings of the Third International Conference on Language Ressources and Evaluation (LREC 2002)]]></source>
<year>2002</year>
<volume>3</volume>
<page-range>1068-1074</page-range></nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cerbah]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Exogeneous and endogeneous approaches to semantic categorization of unknown technical terms,]]></article-title>
<source><![CDATA[Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000)]]></source>
<year>2000</year>
<page-range>145-151</page-range></nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Daille]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Conceptual structuring through term variations,]]></article-title>
<source><![CDATA[Proceedings ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment]]></source>
<year>2003</year>
<page-range>9-16</page-range></nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Abeille]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Clement]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Toussenel]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<source><![CDATA[Building a tree bank for French]]></source>
<year>2003</year>
<page-range>165-187</page-range><publisher-name><![CDATA[Kluwer Academic Publishers]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B20">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hoad]]></surname>
<given-names><![CDATA[T. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Zobel]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Methods for identifying versioned and plagiarised documents,]]></article-title>
<source><![CDATA[Journal of the American Society for Information Science and Technology]]></source>
<year>2002</year>
<volume>54</volume>
<page-range>203-215</page-range></nlm-citation>
</ref>
<ref id="B21">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Metzler]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Bernstein]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Croft]]></surname>
<given-names><![CDATA[B. W.]]></given-names>
</name>
<name>
<surname><![CDATA[Moffat]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Zobel]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Similarity measures for tracking information flow,]]></article-title>
<source><![CDATA[CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management]]></source>
<year>2005</year>
<page-range>517-524</page-range><publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[ACM]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
