SIMTEX: An Approach for Detecting and Measuring Textual Similarity based on Discourse and Semantics

da Cunha, Iria; Vivaldi, Jorge; Torres-Moreno, Juan-Manuel; Sierra, Gerardo

doi:10.13053/CyS-18-3-2033

Services on Demand

Journal

Article

Indicators

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.3 Ciudad de México Jul./Sep. 2014

https://doi.org/10.13053/CyS-18-3-2033

Artículos regulares

SIMTEX: An Approach for Detecting and Measuring Textual Similarity based on Discourse and Semantics

Iria da Cunha¹, Jorge Vivaldi¹, Juan-Manuel Torres-Moreno^2,3, and Gerardo Sierra^2,4

¹ University Institute for Applied Linguistics (Universitat Pompeu Fabra), Barcelona, Spain. iria.dacunha@upf.edu, jorge.vivaldi@upf.edu,

² LIA/Agorantic/Université d'Avignon et des Pays de Vaucluse, Avignon, France. juan-manuel.torres@univ-avignon.fr

³ École Poytechnique de Montréal, Montréal, (Quebec) Canada.

⁴ Universidad Nacional Autónoma de México/Instituto de Ingeniería, Mexico DF, México., GSierraM@ii.unam.mx.

Article received on 20/01/2014.
Accepted on 21/03/2014.

Abstract

Nowadays automatic systems for detecting and measuring textual similarity are being developed, in order to apply them to different tasks in the field of Natural Language Processing (NLP). Currently, these systems use surface linguistic features or statistical information. Nowadays, few researchers use deep linguistic information. In this work, we present an algorithm for detecting and measuring textual similarity that takes into account information offered by discourse relations of Rhetorical Structure Theory (RST), and lexical-semantic relations included in EuroWordNet. We apply the algorithm, called SIMTEX, to texts written in Spanish, but the methodology is potentially language-independent.

Keywords: Textual similarity, discourse, semantics, paraphrase.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgments

We acknowledge the Mexico's National Council of Science and Technology (Conacyt) grant number 178248 and Project UNAM-DGAPA-PAPIIT number IN400312. We also acknowledge the support of the Spanish projects RICOTERM 4 (FFI201021365-C03-01) and APLE 2 (FFI2012-37260), a Juan de la Cierva grant (JCI-2011-09665) and an Ibero-America Young Teachers and Researchers Santander Grant 2013.

References

1. Agirre, E., Cer, D., Diab, M., González-Agirre, A., & Guo, W. (2013). SEM 2013 shared task: Semantic Textual Similarity. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Atlanta, Georgia, USA, 32-43. [ Links ]

2. Banchs, R. & Costa-jussá, M. (2011). A semantic feature for statistical machine translation. In Proceedings of SSST-5, Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation. ACL HLT2011. Portland, Oregon, 126-134. [ Links ]

3. Bar, D., Biemann, C., Gurevych, I., & Zesch, T. (2012). Ukp: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), volume 2. Association for Computational Linguistics, Montreal, Canada, 435-440. [ Links ]

4. Barrón-Cedeño, A., Potthast, M., Rosso, P., Stein, B., & Eiselt, A. (2010). Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). European Language Resources Association, Valletta, Malta, 771-774. [ Links ]

5. Barrón-Cedeño, A., Vila, M., Martí, M., & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics, 39(4), 917-947. [ Links ]

6. Barzilay, R. & McKeown, K. R. (2001). Extracting Paraphrases from a Parallel Corpus. In Proceedings of the 39th Annual Meeting of the ACL. Association for Computational Linguistics, Toulouse, France, 50-57. [ Links ]

7. Buscaldi, D., Le Roux, J., Garcia Flores, J., & Popescu, A. (2013). LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Atlanta, Georgia, USA, 162-168. [ Links ]

8. Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank. In Pennsylvania: Linguistic Data Consortium. Pennsylvania. [ Links ]

9. Castro Rolón, B., Sierra, G., Torres-Moreno, J.-M., & da Cunha, I. (2011). El discurso y la semántica como recursos para la detección de similitud textual. In Proceedings of the III RST Meeting (8th Brazilian Symposium in Information and Human Language Technology, STIL 2011). Brazilian Computer Society, Cuiabá, Brasil. [ Links ]

10. Clough, P., Gaizauskas, R., & Piao, S. (2002). Building and annotating a corpus for the study of journalist text reuse. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'02), volume 5. Las Palmas, Canary Islands, Spain, 1678-1691. [ Links ]

11. Croce, D., Annesi, P., Storch, V., & Basili, R. (2012). Unitor: Combining semantic text similarity functions through sv regression. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Montreal, Canada, 597-602. [ Links ]

12. da Cunha, I., SanJuan, E., Torres-Moreno, J.-M., Lloberes, M., & Castellon, I. (2012). DiSeg 1.0: The First System for Spanish Discourse Segmentation. Expert Systems with Applications, 39(2), 1671-1678. [ Links ]

13. da Cunha, I., Torres-Moreno, J.-M., & Sierra, G. (2011). On the Development of the RST Spanish Treebank. In Proceedings of the 5th Linguistic Annotation Workshop. Association for Computational Linguistics, Portland, Oregon, USA, 1-10. [ Links ]

14. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407. [ Links ]

15. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsu-pervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics, COLING'04. Association for Computational Linguistics, Geneva, Switzerland, 1-7. [ Links ]

16. Gonzalez-Agirre, A., Laparra, E., & Rigau, G. (2012). Multilingual central repository version 3.0. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 12). European Language Resources Association (ELRA), Istanbul, Turkey. [ Links ]

17. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10(8), 707-710. [ Links ]

18. Lushan, H., Kashyap, A., Finin, T., Mayfield, J., & Weese, J. (2013). UMBCEBIQUITY-CORE: Semantic Textual Similarity Systems. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Atlanta, Georgia, USA, 44-52. [ Links ]

19. Mann, W. C. & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243-281. [ Links ]

20. Marcu, D. (2000). The Rhetorical Parsing of Unrestricted Texts: A Surface-based Approach. Computational Linguistics, 26(3), 395-448. [ Links ]

21. Marcu, D. (2000). The Theory and Practice of Discourse Parsing Summarization. The MIT Press, Cambridge, MA, USA. ISBN 0262133725. [ Links ]

22. Maurer, H., Kappe, F., & Zaka, B. (2006). Plagiarism - A survey. Journal of Universal Computer Science, 12(8), 1050-1084. [ Links ]

23. Maynard, D. (1999). Term recognition using combined knowledge sources. Ph.D. thesis, Manchester Metropolitan University, Faculty of Science and Engineering. [ Links ]

24. Maziero, E., Pardo, T., da Cunha, I., Torres-Moreno, J.-M., & SanJuan, E. (2011). DiZer 2.0-An Adaptable On-line Discourse Parser. In Proceedings of the III RST Meeting (8th Brazilian Symposium in Information and Human Language Technology). 50-57. [ Links ]

25. Meadow, C. T. (1992). Text Information Retrieval Systems. Academic Press, Inc., Orlando, FL, USA. [ Links ]

26. Pardo, T. & Nunes, M. (2008). On the development and evaluation of a brazilian portuguese discourse parser. Journal of Theoretical and Applied Computing, 15(2), 43-64. [ Links ]

27. Polajnar, T., Rimell, L., & Kiela, D. (2013). UCAM-CORE: Incorporating structured distributional similarity into STS. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Atlanta, Georgia, USA, 85-89. [ Links ]

28. Severyn, A., Nicosia, M., & Moschitti, A. (2013). iKernels-Core: Tree Kernel Learning for Textual Similarity. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, volume 1. Association for Computational Linguistics, Atlanta, Georgia, USA, 53-58. [ Links ]

29. Shawe-Taylor, J. & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA. ISBN 0521813972. [ Links ]

30. Sidorov, G., Velasquez, R, Stamatatos, E., Gel-bukh, A., & Chanona-Hernandez, L. (2014). Syntactic N-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853-860. [ Links ]

31. Spassova, M. S. (2009). El potencial discriminatorio de las secuencias de categorías gramaticales en la atribución forense de autoría de textos en español. Ph.D. thesis, IULA, Universitat Pompeu Fabra, Barcelona. [ Links ]

32. Vila, M., Martí, A., & Rodríguez, H. (2011). Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach. Procesamiento del Lenguaje Natural, 46, 83-90. [ Links ]

33. Vivaldi, J. (2001). Extracción de candidatos a términos mediante combinación de estrategias heterogéneas. Ph.D. thesis, IULA, Universitat Pompeu Fabra, Barcelona. [ Links ]

34. Vivaldi, J., da Cunha, I., Torres-Moreno, J. M., & Velazquez-Morales, P. (2010). Automatic summarization using terminological and semantic resources. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). European Language Resources Association, Valletta, Malta, 3105-3112. [ Links ]

35. Wu, Z. & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the ACL. Association for Computational Linguistics, Las Cruces, New Mexico, USA, 133-138. [ Links ]