SciELO - Scientific Electronic Library Online

 
vol.18 issue3Dependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase RecognitionVector Space Basis Change in Information Retrieval author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.3 México Jul./Sep. 2014

http://dx.doi.org/10.13053/CyS-18-3-2040 

Artículos regulares

 

Paraphrase and Textual Entailment Generation in Czech

 

Zuzana Nevěřilová

 

Natural Language Processing Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic. xpopelk@fi.muni.cz.

 

Article received on 07/01/2014.
Accepted on 01/02/2014.

 

Abstract

Paraphrase and textual entailment generation can support natural language processing (NLP) tasks that simulate text understanding, e.g., text summarization, plagiarism detection, or question answering. A paraphrase, i.e., a sentence with the same meaning, conveys a certain piece of information with new words and new syntactic structures. Textual entailment, i.e., an inference that humans will judge most likely true, can employ real-world knowledge in order to make some implicit information explicit. Paraphrases can also be seen as mutual entailments. We present a new system that generates paraphrases and textual entailments from a given text in the Czech language. First, the process is rule-based, i.e., the system analyzes the input text, produces its inner representation, transforms it according to transformation rules, and generates new sentences. Second, the generated sentences are ranked according to a statistical model and only the best ones are output. The decision whether a paraphrase or textual entailment is correct or not is left to humans. For this purpose we designed an annotation game based on a conversation between a detective (the human player) and his assistant (the system). The result of such annotation is a collection of annotated pairs text-hypothesis. Currently, the system and the game are intended to collect data in the Czech language. However, the idea can be applied for other languages. So far, we have collected 3,321 H-T pairs. From these pairs, 1,563 were judged correct (47.06 %), 1,238 (37.28 %) were judged incorrect entailments, and 520 (15.66 %) were judged non-sense or unknown.

Keywords: Games with a purpose, paraphrase, textual entailment, natural language generation.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

Acknowledgments

This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2010013 and by the Ministry of the Interior of CR within the project VF20102014003.

The access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme "Projects of Large Infrastructure for Research, Development, and Innovations" (LM2010005) is appreciated.

 

References

1. Akhmatova, E. (2005). Textual entailment resolution via atomic propositions. In PASCAL: Proceedings of the First Challenges Workshop on Recognising Textual Entailment. Southampton, UK, 61-64.         [ Links ]

2. Androutsopoulos, I. & Malakasiotis, P. (2009). A survey of paraphrasing and textual entailment methods. CoRR, abs/0912.3747.         [ Links ]

3. Bhagat, R. & Hovy, E. (2013). What is a paraphrase? Computational Linguistics, 39(3), 463—472. ISSN 0891-2017. doi:10.1162/COLI_a_00166.         [ Links ]

4. Boyd-Graber, J., Fellbaum, C., Osherson, D., & Schapire, R. (2006). Adding dense, weighted connections to WordNet. In Proceedings of the Third International WordNet Conference GWC-06. Masaryk University in Brno, South Jeju Island, Korea, 29-36.         [ Links ]

5. Chamberlain, J., Kruschwitz, U., & Poesio, M. (2009). Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, People s Web 09. Association for Computational Linguistics, Stroudsburg, PA, USA. ISBN 978-1932432-55-8, 57-62.         [ Links ]

6. Chklovski, T. (2005). Collecting paraphrase corpora from volunteer contributors. In Proceedings of the 3rd International Conference on Knowledge Capture, K-CAP '05. ACM, New York, NY, USA. ISBN 1-59593-163-5, 115-120. doi:10.1145/1088622.1088644.         [ Links ]

7. Clark, P., Fellbaum, C., & Hobbs, J. R. (2006). The Boeing-Princeton-ISI (BPI) textual entailment test suite. Accessed online 201404-14 from http://www.cs.utexas.edu/~pclark/bpi-test-suite/bpi-rte-knowledge-types.txt.         [ Links ]

8. Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In Quiñonero-Candela, J., Dagan, I., Magnini, B., & d'Alché-Buc, F., editors, Machine Learning Challenges, volume 3944 of Lecture Notes in Computer Science. Springer, p. 177-190.         [ Links ]

9. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. ISBN 026206197X.         [ Links ]

10. Grác, M. (2013). Rapid Development of Language Resources. Ph.D. thesis, Masaryk University, Brno, Czech Republic.         [ Links ]

11. Graesser, A. (1981). Prose Comprehension Beyond the Word. Springer-Verlag, New York. ISBN 9783540905448.         [ Links ]

12. Hajič, J., Hajičova, E., Pajas, P., Panevová, J., Sgall, P., & Vidova-Hladká, B. (2001). Prague Dependency Treebank 1.0 (Final Production Label). Linguistic Data Consortium. ISBN 1-58563-212-0. Published: CDROM CAT: LDC2001T10.         [ Links ]

13. Havasi, C., Speer, R., & Alonso, J. (2009). Con-ceptNet: a lexical resource for common sense knowledge. In Nicolov, N., Angelova, G., & Mitkov, R., editors, Recent Advances in Natural Language Processing V, volume 309 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam & Philadelphia, 269-280.         [ Links ]

14. Hlaváčkova, D. & Horák, A. (2005). VerbaLex -new comprehensive lexicon of verb valencies for Czech. In Computer Treatment of Slavic and East European Languages. Slovenský národný korpus. ISBN 80-224-0895-6, 107-115.         [ Links ]

15. Kovář, V., Horák, A., & Jakubíček, M. (2011). Syntactic analysis using finite patterns: A new parsing system for Czech. In Human Language Technology. Challenges for Computer Science and Linguistics, volume November 6-8, 2009. Poznan, Poland, 161-171.         [ Links ]

16. Kučová, L. & Žabokrtský, Z. (2005). Anaphora in Czech: Large data and experiments with automatic anaphora resolution. In Matoušrek, V., Mautner, P., & Pavelka, T., editors, Proceedings of 8th International Conference on Text, Speech and Dialogue, TSD 2005, volume 3658 of Lecture Notes in Computer Science. Springer Berlin Heidelberg. ISBN 978-3-540-28789-6, 93-98. doi:10.1007/11551874_12.         [ Links ]

17. Nevěřilova, Z. (2014). Annotation game for textual entailment evaluation. In Gelbukh, A., editor, 15th International Conference, CICLing 2014, Kath-mandu, Nepal, April 6-12,2014, Proceedings, Part I, volume 8403 of Lecture Notes in Computer Science. Springer. ISBN 978-3-642-54905-2, 340-350.         [ Links ]

18. Pala, K. & Smrž, P. (2004). Building Czech Word-net. Romanian Journal of Information Science and Technology, 2004(7), 79-88.         [ Links ]

19. Palmer, M. S., Dahl, D. A., Schiffman, R. J., Hirschman, L., Linebarger, M., & Dowding, J. (1986). Recovering implicit information. In Proceedings of the 24th Annual Meeting on Association for Computational Linguistics, ACL '86. Association for Computational Linguistics, Stroudsburg, PA, USA, 10-19. doi:10.3115/981131.981135.         [ Links ]

20. Sammons, M., Vydiswaran, V. G. V., & Roth, D. (2010). Ask not what textual entailment can do for you... In Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, 1199-1208.         [ Links ]

21. Šmerk, P. (2010). Towards Computational Morphological Analysis of Czech. Ph.D. thesis, Masaryk University in Brno, Brno, Czech Republic.         [ Links ]

22. Šmerk, P. & Hlaváčková, D. (2012). Derivacni an-alyzator ccestiny Derivance [Derivational analyzer for Czech]. [software] accessed online 2014-04-09 from http://nlp.fi.muni.cz/projekty/derivance.         [ Links ]

23. Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08. Association for Computational Linguistics, Stroudsburg, PA, USA, 254-263.         [ Links ]

24. Venhuizen, N., Basile, V., Evang, K., & Bos, J. (2013). Gamification for word sense labeling. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) - Short Papers. Potsdam, Germany, 397-403.         [ Links ]

25. Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A., & Koller, D. (2008). Online word games for semantic data collection. In EMNLP 08: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Morristown, NJ, USA, 533-542.         [ Links ]

26. von Ahn, L. & Dabbish, L. (2008). Designing games with a purpose. Commun. ACM, 51(8), 58-67. ISSN 0001-0782. doi:10.1145/1378704.1378719.         [ Links ]

27. von Ahn, L., Kedia, M., & Blum, M. (2006). Verbosity: a game for collecting common-sense facts. In CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems. ACM, New York, NY, USA. ISBN 1-59593-372-7, 75-78. doi: http://doi.acm.org/10.1145/1124772.1124784.         [ Links ]

28. Wang, A., Hoang, C., & Kan, M.-Y. (2013). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1), 9-31. ISSN 1574-020X. doi: 10.1007/s10579-012-9176-1.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License