TrainQA: a Training Corpus for Corpus-Based Question Answering Systems

Tomás, David; Vicedo, José L.; Bisbal, Empar; Moreno, Lidia

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Polibits

On-line version ISSN 1870-9044

Polibits n.40 México Jul./Dec. 2009

Special section: Information Retrieval and Natural Language Processing

TrainQA: a Training Corpus for Corpus–Based Question Answering Systems

David Tomás¹, José L. Vicedo¹, Empar Bisbal², and Lidia Moreno²

¹ Department of Software and Computing Systems, University of Alicante, Spain. (dtomas@dlsi.ua.es, vicedo@dlsi.ua.es)

² Department of Information Systems and Computation, Technical University of Valencia, Spain. (ebisbal@dsic.upv.es, lmoreno@dsic.upv.es)

Manuscript received November 23, 2008.
Manuscript accepted for publication August 15, 2009.

Abstract

This paper describes the development of an English corpus of factoid TREC–like question–answer pairs. The corpus obtained consists of more than 70,000 samples, containing each one the following information: a question, its question type, an exact answer to the question, the different contexts levels (sentence, paragraph and document) where the answer occurs inside a document, and a label indicating whether the answer is correct (a positive sample) or not (a negative sample). For instance, TrainQA can be used for training a binary classifier in order to decide if a given answer is correct (positive) to the question formulated or not (negative). To our knowledge, this is the first corpus aimed to train on every stage of a trainable Question Answering system: question classification, information retrieval, answer extraction and answer validation.

Key words: Question answering, corpus–based systems.

DESCARGAR ARTÍCULO EN FORMATO PDF

ACKNOWLEDGEMENT

This work has been developed in the framework of the project CICYT R2D2 (TIC2003–07158–C04).

REFERENCES

[1] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a large annotated corpus of english: The penn treebank," Computational Linguistics, vol. 19, no. 2, pp. 313–330, 1994. [ Links ]

[2] D. Ravichandran, A. Ittycheriah, and S. Roukos, "Automatic derivation of surface text patterns for a maximum entropy based question answering system," in NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics, 2003, pp. 85–87. [ Links ]

[3] R. Soricut and E. Brill, "Automatic question answering using the web: Beyond the factoid," Information Retrieval, vol. 9, no. 2, pp. 191–206, 2006. [ Links ]

[4] E. Agichtein, S. Lawrence, and L. Gravano, "Learning search engine specific query transformations for question answering," in WWW '01: Proceedings of the 10th international conference on World Wide Web. New York, NY, USA: ACM, 2001, pp. 169–178. [ Links ]

[5] R. D. Burke, K. J. Hammond, V. A. Kulyukin, S. L. Lytinen, N. Tomuro, and S. Schoenberg, "Question answering from frequently asked question files: Experiences with the faq finder system," Chicago, IL, USA, Tech. Rep., 1997. [ Links ]

[6] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal, "Bridging the lexical chasm: statistical approaches to answer–finding," in SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2000, pp. 192–199. [ Links ]

[7] E. Bisbal, D. Tomas, L. Moreno, J. L. Vicedo, and A. Suarez, "A multilingual svm–based question classification system," in MICAI 2005: Advances in Artificial Intelligence, 4th Mexican International Conference on Artificial Intelligence, ser. Lecture Notes in Computer Science, A. F. Gelbukh, A. de Albornoz, and H. Terashima–Marín, Eds., vol. 3789. Springer, November 2005, pp. 806–815. [ Links ]

[8] I. Dagan, O. Glickman, and B. Magnini, "Recognizing textual entailment," in PASCAL Proceedings of the First Challenge Workshop, Southampton, UK, April 2005, pp. 1–8. [ Links ]

[9] E. M. Voorhees, "The trec–8 question answering track report," in Eighth Text REtrieval Conference, ser. NIST Special Publication, vol. 500–246. Gaithersburg, USA: National Institute of Standards and Technology, November 1999, pp. 77–82. [ Links ]

[10] J. C. Reynar and A. Ratnaparkhi, "A maximum entropy approach to identifying sentence boundaries," in Proceedings of the fifth conference on Applied natural language processing. Morristown, NJ, USA: Association for Computational Linguistics, 1997, pp. 16–19. [ Links ]

[11] J. L. Fleiss, "Measuring nominal scale agreement among many raters," Psychological Bulletin, vol. 76, no. 5, pp. 378–382, 1971. [ Links ]