<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462016000200279</article-id>
<article-id pub-id-type="doi">10.13053/cys-20-2-2369</article-id>
<title-group>
<article-title xml:lang="es"><![CDATA[Detección automática de similitud entre programas del lenguaje de programación Karel basada en técnicas de procesamiento de lenguaje natural]]></article-title>
<article-title xml:lang="en"><![CDATA[Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[Grigori]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Ibarra Romero]]></surname>
<given-names><![CDATA[Martín]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[Ilia]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Guzman-Cabrera]]></surname>
<given-names><![CDATA[Rafael]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Chanona-Hernández]]></surname>
<given-names><![CDATA[Liliana]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Velásquez]]></surname>
<given-names><![CDATA[Francisco]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Instituto Politécnico Nacional Centro de Investigación en Computación ]]></institution>
<addr-line><![CDATA[Mexico ]]></addr-line>
<country>Mexico</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,Universidad de Guanajuato Division de ingenierías ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<aff id="Af3">
<institution><![CDATA[,Instituto Politécnico Nacional Escuela Superior de Ingeniería Mecánica y Eléctrica ]]></institution>
<addr-line><![CDATA[Mexico ]]></addr-line>
<country>Mexico</country>
</aff>
<aff id="Af4">
<institution><![CDATA[,Universidad Politécnica de Querétaro  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>06</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>06</month>
<year>2016</year>
</pub-date>
<volume>20</volume>
<numero>2</numero>
<fpage>279</fpage>
<lpage>288</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462016000200279&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462016000200279&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462016000200279&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="es"><p><![CDATA[Este artículo presenta un método para calcular la similitud entre programas (código fuente). La tarea es útil, por ejemplo, para la clasificación temática de programas o detección de reuso de código (digamos, en el caso de plagio). Usamos para los experimentos el lenguaje de programación Karel. Para determinar la similitud entre programas y/o ideas de soluciones similares utilizamos un enfoque basado en técnicas de procesamiento de lenguaje natural y de recuperación de información. Estas técnicas usan la representación de un documento como un vector de valores de características. Usualmente, las características son n-gramas de palabras o de caracteres. Posteriormente, se puede aplicar el análisis semántico latente para reducir la dimensionalidad de este espacio vectorial. Finalmente, se usa el aprendizaje automático supervisado para la clasificación de textos (o programas que son textos también) parecidos. Para validar el método propuesto, se compiló un corpus de programas para 100 tareas diferentes con un total de 9,341 códigos y otro corpus para 34 tareas adicionalmente clasificado por la idea de solución, formado por 374 códigos. Los resultados experimentales muestran que para el corpus con ideas de solución es mejor la representación con trigramas de caracteres, mientras que para el corpus completo los mejores resultados se obtienen con trigramas de términos y la aplicación del análisis semántico latente.]]></p></abstract>
<abstract abstract-type="short" xml:lang="en"><p><![CDATA[In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: the first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.]]></p></abstract>
<kwd-group>
<kwd lng="es"><![CDATA[Similitud]]></kwd>
<kwd lng="es"><![CDATA[n-gramas]]></kwd>
<kwd lng="es"><![CDATA[programa]]></kwd>
<kwd lng="es"><![CDATA[código fuente]]></kwd>
<kwd lng="es"><![CDATA[análisis semántico latente]]></kwd>
<kwd lng="es"><![CDATA[recuperación de información]]></kwd>
<kwd lng="es"><![CDATA[procesamiento de lenguaje natural]]></kwd>
<kwd lng="en"><![CDATA[Similarity]]></kwd>
<kwd lng="en"><![CDATA[n-grams]]></kwd>
<kwd lng="en"><![CDATA[program]]></kwd>
<kwd lng="en"><![CDATA[source code]]></kwd>
<kwd lng="en"><![CDATA[latent semantic analysis]]></kwd>
<kwd lng="en"><![CDATA[information retrieval]]></kwd>
<kwd lng="en"><![CDATA[natural language processing]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[An approach to source-code plagiarism detection and investigation using latent semantic analysis]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cosma]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[IEEE Transactions on Computers]]></source>
<year>2008</year>
<volume>61</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>379-94</page-range></nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[Indexing by latent semantic analysis]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[Deerwester]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Dumais]]></surname>
<given-names><![CDATA[S. T.]]></given-names>
</name>
<name>
<surname><![CDATA[Furnas]]></surname>
<given-names><![CDATA[G. W.]]></given-names>
</name>
<name>
<surname><![CDATA[Landauer]]></surname>
<given-names><![CDATA[T. K.]]></given-names>
</name>
<name>
<surname><![CDATA[Harshman]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Journal of the American society for information science]]></source>
<year>1990</year>
<volume>41</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>391-407</page-range></nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Flores]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Ibarra]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Moreno]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Rosso]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<source><![CDATA[Modelos de recuperación de información basados en n-gramas aplicados a la reutilización de código fuente]]></source>
<year>2014</year>
<conf-name><![CDATA[ 3Spanish Conference on Information Retrieval, CERI '14]]></conf-name>
<conf-loc> </conf-loc>
<page-range>185-8</page-range></nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[Sim: A utility for detecting similarity in computer programs]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gitchell]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Tran]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<source><![CDATA[SIGCSE Bull.]]></source>
<year>1999</year>
<volume>31</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>266-70</page-range></nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gómez-Adorno]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Pinto]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<source><![CDATA[A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs]]></source>
<year>2015</year>
<volume>1391</volume>
<conf-name><![CDATA[ CLEF'15]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[The WEKA data mining software: An update]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hall]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Frank]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Holmes]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Pfahringer]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Reutemann]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Witten]]></surname>
<given-names><![CDATA[I. H.]]></given-names>
</name>
</person-group>
<source><![CDATA[SIGKDD Explorations]]></source>
<year>2009</year>
<volume>11</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>10-8</page-range></nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Halstead]]></surname>
<given-names><![CDATA[M. H.]]></given-names>
</name>
</person-group>
<source><![CDATA[Elements of Software Science]]></source>
<year>1977</year>
<publisher-loc><![CDATA[New York, NY, USA ]]></publisher-loc>
<publisher-name><![CDATA[Elsevier Science Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hsu]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Lin]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[A block-structured model for source code retrieval]]></source>
<year>2011</year>
<conf-name><![CDATA[ Intelligent Information and Database Systems: Third International Conference, ACIIDS 2011]]></conf-name>
<conf-loc> </conf-loc>
<page-range>161-70</page-range><publisher-loc><![CDATA[Berlin Heidelberg, Berlin, Heidelberg ]]></publisher-loc>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Manning]]></surname>
<given-names><![CDATA[C. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Raghavan]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Schütze]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<source><![CDATA[Introduction to Information Retrieval]]></source>
<year>2008</year>
<publisher-loc><![CDATA[New York, NY, USA ]]></publisher-loc>
<publisher-name><![CDATA[Cambridge University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[A complexity measure]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[McCabe]]></surname>
<given-names><![CDATA[T. J.]]></given-names>
</name>
</person-group>
<source><![CDATA[EEE transaction on software Engineering]]></source>
<year>1976</year>
<volume>2</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>308-20</page-range></nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pattis]]></surname>
<given-names><![CDATA[R. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Reoberts]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Stehlik]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Karel the Robot: Gentle Introduction to the Art of Programming]]></source>
<year>1994</year>
<edition>2</edition>
<publisher-loc><![CDATA[New York, NY, USA ]]></publisher-loc>
<publisher-name><![CDATA[John Wiley & Sons, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Posadas-Duran]]></surname>
<given-names><![CDATA[J. P.]]></given-names>
</name>
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Gómez-Adorno]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Batyrshin]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Gelbukh]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Pichardo-Lagunas]]></surname>
<given-names><![CDATA[O.]]></given-names>
</name>
</person-group>
<source><![CDATA[Syntactic n-grams as features for the author profiling task. Working Notes Papers of the CLEF 2015 Evaluation Labs]]></source>
<year>2015</year>
<volume>1391</volume>
<conf-name><![CDATA[ CLEF'15]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Schleimer]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Wilkerson]]></surname>
<given-names><![CDATA[D. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Aiken]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Winnowing: Local algorithms for document fingerprinting]]></source>
<year>2003</year>
<conf-name><![CDATA[ 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD '03]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>76-85</page-range></nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[Construction no lineal de n-gramas en la lingüística computational: n-gramas sintacticos, filtrados y generalizados]]></source>
<year>2013</year>
<publisher-loc><![CDATA[México ]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="journal">
<article-title xml:lang=""><![CDATA[Soft similarity and soft cosine measure: Similarity of features in vector space model]]></article-title>
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Gelbukh]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Gomez-Adorno]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Pinto]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Computation y Sistemas]]></source>
<year>2014</year>
<volume>18</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>491-504</page-range></nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Gomez-Adorno]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Pinto]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Loya]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<source><![CDATA[Computing text similarity using tree edit distance]]></source>
<year>2015</year>
<conf-name><![CDATA[ 5World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American, NAFIPS '15]]></conf-name>
<conf-loc> </conf-loc>
<page-range>1-4</page-range></nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wise]]></surname>
<given-names><![CDATA[M. J.]]></given-names>
</name>
</person-group>
<source><![CDATA[YAP3: Improved detection of similarities in computer program and other texts]]></source>
<year>1996</year>
<conf-name><![CDATA[ Twenty-seventhSIGCSE Technical Symposium on Computer Science Education, SIGCSE '96]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>130-4</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
