Dependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition

Calvo, Hiram; Segura-Olivares, Andrea; García, Alejandro

doi:10.13053/CyS-18-3-2044

Services on Demand

Journal

Article

Indicators

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.3 Ciudad de México Jul./Sep. 2014

https://doi.org/10.13053/CyS-18-3-2044

Artículos regulares

Dependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition

Hiram Calvo, Andrea Segura-Olivares, and Alejandro García

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico. hcalvo@cic.ipn.mx, msegura_b12@sagitario.cic.ipn.mx, igarcia_b12@sagitario.cic.ipn.mx.

Article received on 04/02/2014.
Accepted on 17/03/2014.

Abstract

Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic n-grams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent n-grams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

Keywords: Paraphrase recognition, Microsoft Research paraphrase corpus, similarity measures, syntactic n-grams, constituent analysis, dependency analysis.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgements

Work done under support of CONACyT-SNI, SIP-IPN, COFAA-IPN, and PIFI-IPN.

References

1. Blacoe, W. & Lapata, M. (2012). A Comparison of Vector-based Representations for Semantic Composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12. ACL, Stroudsburg, PA, USA, 546-556. [ Links ]

2. Das, D. & Smith, N. A. (2009). Paraphrase Identification As Probabilistic Quasi-synchronous Recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, volume 1 of ACL '09. ACL, Stroudsburg, PA, USA. ISBN 978-1-93243245-9, 468-476. [ Links ]

3. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04. ACL, Stroudsburg, PA, USA. [ Links ]

4. Fernando, S. & Stevenson, M. (2008). A Semantic Similarity Approach to Paraphrase Detection. In Proceedings of the Computational Linguistics UK (CLUK2008) 11th Annual Research Colloquium. [ Links ]

5. Finch, A., Hwang, Y., & Sumita, E. (2005). Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005). Jeju Island, South Korea, 17-24. [ Links ]

6. Heilman, M. & Smith, N. A. (2010). Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, HLT '10. ACL, Stroudsburg, PA, USA. ISBN 1-932432-65-5, 1011-1019. [ Links ]

7. Islam, A. & Inkpen, D. (2007). Semantic Similarity of Short Texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2007). John Benjamins Publishing, Borovets, Bulgaria, 291-297. [ Links ]

8. Kozareva, Z. & Montoyo, A. (2006). Paraphrase Identification on the Basis of Supervised Machine Learning Techniques. In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL '06. Springer-Verlag, Berlin, Heidelberg. ISBN 3-540-37334-9, 978-3-540-37334-6, 524-533. [ Links ]

9. Lin, D. (1998). An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML '98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. ISBN 1-55860-556-8, 296-304. [ Links ]

10. Lin, D. (1998). Dependency-based Evaluation of MINIPAR. In Proceedings of the Workshop on the Evaluation of Parsing Systems. Springer Netherlands, Granada, Spain. [ Links ]

11. Lintean, M. & Rus, V. (2011 ). Dissimilarity Kernels for Paraphrase Identification. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research SocietyConference. 263-268. [ Links ]

12. Madnani, N., Tetreault, J., & Chodorow, M. (2012). Re-examining Machine Translation Metrics for Paraphrase Identification. In Proceedings of the 2012 Conference of the North American Chapter of the ACL: Human Language Technologies, NAACL HLT '12. ACL, Stroudsburg, PA, USA. ISBN 978-1937284-20-6, 182-190. [ Links ]

13. Malakasiotis, P. (2009). Paraphrase Recognition Using Machine Learning to Combine Similarity Measures. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, ACLstudent '09. ACL, Stroudsburg, PA, USA, 27-35. [ Links ]

14. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In Proceedings of the 21st National Conference on Artificial Intelligence, volume 1 of AAAI'06. AAAI Press. ISBN 978-157735-281-5, 775-780. [ Links ]

15. Qiu, L., Kan, M.-Y., & Chua, T.-S. (2006). Paraphrase Recognition via Dissimilarity Significance Classification. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP '06. ACL, Stroudsburg, PA, USA. ISBN 1-932432-73-6, 18-26. [ Links ]

16. Rus, V., McCarthy, P. M., & Lintean, M. C. (2008). Paraphrase Identification with Lexico-Syntactic Graph Subsumption. In Proceedings of the 21st International FLAIRS Conference. Coconut Grove, FL., 201-206. [ Links ]

17. Segura-Olivares, A., García, A., & Calvo, H. (2013). Feature Analysis for Paraphrase Recognition and Textual Entailment. In Research in Computing Science, volume 70 of Advances in Computational Linguistics (CICLing-2013). Kathmandu, Nepal, 119-144. [ Links ]

18. Sidorov, G. (2013). Construcción no lineal de n-gramas en la lingüística computacional: n-gramas sintácticos, filtrados y generalizados. Sociedad Mexicana de Inteligencia Artificial. ISBN 978-60795367-9-4. [ Links ]

19. Sidorov, G. (2013). Non-continuous Syntactic N-grams. Polibits, 48, 67-75. [ Links ]

20. Sidorov, G. (2013). Syntactic Dependency Based N-grams in Rule Based Automatic English as Second Language Grammar Correction. International Journal of Computational Linguistics and Applications, 4(2), 169-188. [ Links ]

21. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., & Pinto, D. (2014). Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model. Computación y Sistemas, 18(3). [ Links ]

22. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernández, L. (2012). Syntactic Dependency-Based N-grams as Classification Features. In Lecture Notes in Artificial Intelligence, volume 7630. Springer Berlin Heidelberg. ISBN 978-3-642-37797-6, 1-11. [ Links ]

23. Sidorov, G., Velasquez, F., Stamatatos, E., Gel-bukh, A., & Chanona-Hernández, L. (2014). Syntactic N-grams as Machine Learning Features for Natural Language Processing. Expert Systems with Applications, 41(3), 853-860. [ Links ]

24. Socher, R., Huang, E. H., Pennington, J., Y. Ng, A., & Manning, C. D. (2011a). Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In Advances in Neural Information Processing Systems. MIT Press. [ Links ]

25. Swanson, R. & Gordon, A. S. (2006). A Comparison of Alternative Parse Tree Paths for Labeling Semantic Roles. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL '06. ACL, Stroudsburg, PA, USA, 811-818. [ Links ]

26. Ul-Qayyum, Z. & Altaf, W. (2012). Paraphrase Identification using Semantic Heuristic Features. Research Journal of Applied Sciences, Engineering and Technology, 4894-4904.

27. Wan, S., Dras, M., Dale, R., & Paris, C. (2006). Using Dependency-Based Features to Take the "Para-farce" out of Paraphrase. In Proceedings of the Australasian Language Technology Workshop. Sydney, Australia, 131-138. [ Links ]

28. Weeds, J., Weir, D., & Keller, B. (2005). The Distributional Similarity of Sub-parses. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, EMSEE '05. ACL, Stroudsburg, PA, USA, 7-12. [ Links ]

29. Zhang, Y. & Patrick, J. (2005). Paraphrase Identification by Text Canonicalization. In Proceedings of the Australasian Language Technology Workshop 2005. Sydney, Australia, 160-166. [ Links ]