SciELO - Scientific Electronic Library Online

 
 número40EditorialCross Language Information Retrieval using Multilingual Ontology as Translation and Query Expansion Base índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Polibits  no.40 México jul./dic. 2009

 

Special section: Information Retrieval and Natural Language Processing

 

TrainQA: a Training Corpus for Corpus–Based Question Answering Systems

 

David Tomás1, José L. Vicedo1, Empar Bisbal2, and Lidia Moreno2

 

1 Department of Software and Computing Systems, University of Alicante, Spain. (dtomas@dlsi.ua.es, vicedo@dlsi.ua.es)

2 Department of Information Systems and Computation, Technical University of Valencia, Spain. (ebisbal@dsic.upv.es, lmoreno@dsic.upv.es)

 

Manuscript received November 23, 2008.
Manuscript accepted for publication August 15, 2009.

 

Abstract

This paper describes the development of an English corpus of factoid TREC–like question–answer pairs. The corpus obtained consists of more than 70,000 samples, containing each one the following information: a question, its question type, an exact answer to the question, the different contexts levels (sentence, paragraph and document) where the answer occurs inside a document, and a label indicating whether the answer is correct (a positive sample) or not (a negative sample). For instance, TrainQA can be used for training a binary classifier in order to decide if a given answer is correct (positive) to the question formulated or not (negative). To our knowledge, this is the first corpus aimed to train on every stage of a trainable Question Answering system: question classification, information retrieval, answer extraction and answer validation.

Key words: Question answering, corpus–based systems.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

ACKNOWLEDGEMENT

This work has been developed in the framework of the project CICYT R2D2 (TIC2003–07158–C04).

 

REFERENCES

[1] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a large annotated corpus of english: The penn treebank," Computational Linguistics, vol. 19, no. 2, pp. 313–330, 1994.         [ Links ]

[2] D. Ravichandran, A. Ittycheriah, and S. Roukos, "Automatic derivation of surface text patterns for a maximum entropy based question answering system," in NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics, 2003, pp. 85–87.         [ Links ]

[3] R. Soricut and E. Brill, "Automatic question answering using the web: Beyond the factoid," Information Retrieval, vol. 9, no. 2, pp. 191–206, 2006.         [ Links ]

[4] E. Agichtein, S. Lawrence, and L. Gravano, "Learning search engine specific query transformations for question answering," in WWW '01: Proceedings of the 10th international conference on World Wide Web. New York, NY, USA: ACM, 2001, pp. 169–178.         [ Links ]

[5] R. D. Burke, K. J. Hammond, V. A. Kulyukin, S. L. Lytinen, N. Tomuro, and S. Schoenberg, "Question answering from frequently asked question files: Experiences with the faq finder system," Chicago, IL, USA, Tech. Rep., 1997.         [ Links ]

[6] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal, "Bridging the lexical chasm: statistical approaches to answer–finding," in SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2000, pp. 192–199.         [ Links ]

[7] E. Bisbal, D. Tomas, L. Moreno, J. L. Vicedo, and A. Suarez, "A multilingual svm–based question classification system," in MICAI 2005: Advances in Artificial Intelligence, 4th Mexican International Conference on Artificial Intelligence, ser. Lecture Notes in Computer Science, A. F. Gelbukh, A. de Albornoz, and H. Terashima–Marín, Eds., vol. 3789. Springer, November 2005, pp. 806–815.         [ Links ]

[8] I. Dagan, O. Glickman, and B. Magnini, "Recognizing textual entailment," in PASCAL Proceedings of the First Challenge Workshop, Southampton, UK, April 2005, pp. 1–8.         [ Links ]

[9] E. M. Voorhees, "The trec–8 question answering track report," in Eighth Text REtrieval Conference, ser. NIST Special Publication, vol. 500–246. Gaithersburg, USA: National Institute of Standards and Technology, November 1999, pp. 77–82.         [ Links ]

[10] J. C. Reynar and A. Ratnaparkhi, "A maximum entropy approach to identifying sentence boundaries," in Proceedings of the fifth conference on Applied natural language processing. Morristown, NJ, USA: Association for Computational Linguistics, 1997, pp. 16–19.         [ Links ]

[11] J. L. Fleiss, "Measuring nominal scale agreement among many raters," Psychological Bulletin, vol. 76, no. 5, pp. 378–382, 1971.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons