1405-5546

S1405-55462009000300004

México

00 09 2009

13 1 33 44

Artículos

Using Machine Learning for Extracting Information from Natural Disaster News Reports

Usando Aprendizaje Automático para Extraer Información de Noticias de Desastres Naturales

Alberto Téllez Valero, Manuel Montes y Gómez and Luis Villaseñor Pineda

Laboratorio de Tecnologías del Lenguaje, Coordinación de Ciencias Computacionales, Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE). Luis Enrique Erro #1, Tonantzintla, Puebla, México; albertotellezv@ccc.inaoep.mx , mmontesg@ccc.inaoep.mx , villasen@ccc.inaoep.mx

]]> Article received on July 17, 2008
Accepted on April 03, 2009

Abstract

The disasters caused by natural phenomena have been present all along human history; nevertheless, their consequences are greater each time. This tendency will not be reverted in the coming years; on the contrary, it is expected that natural phenomena will increase in number and intensity due to the global warming. Because of this situation it is of great interest to have sufficient data related to natural disasters, since these data are absolutely necessary to analyze their impact as well as to establish links between their occurrence and their effects. In accordance to this necessity, in this paper we describe a system based on Machine Learning methods that improves the acquisition of natural disaster data. This system automatically populates a natural disaster database by extracting information from online news reports. In particular, it allows extracting information about five different types of natural disasters: hurricanes, earthquakes, forest fires, inundations, and droughts. Experimental results on a collection of Spanish news show the effectiveness of the proposed system for detecting relevant documents about natural disasters (reaching an F–measure of 98%), as well as for extracting relevant facts to be inserted into a given database (reaching an F–measure of 76%).

Keywords: Machine Learning, Information Extraction, Text Categorization, Natural Disasters, Databases.

Resumen.

Los desastres causados por fenómenos naturales han estado presentes desde el principio de la historia del hombre; sin embargo, sus consecuencias son cada vez mayores. Esta tendencia podría no ser revertida en los próximos años; al contrario, se espera que los fenómenos naturales puedan incrementar en número e intensidad debido al calentamiento global. A causa de esta situación es de gran interés tener suficientes datos relacionados a los desastres naturales, ya que estos datos son absolutamente necesarios para analizar su impacto así como para establecer conexiones entre su ocurrencia y sus efectos. En correspondencia con esta necesidad, en este artículo describimos un sistema basado en métodos de Aprendizaje Automático que mejora la adquisición de datos de desastres naturales. Este sistema automáticamente llena una base de datos de desastres naturales con la información extraída de noticias de periódicos en línea. En particular, este sistema permite extraer información acerca de cinco tipos de desastres naturales: huracanes, temblores, incendios forestales, inundaciones y sequías. Los resultados experimentales en una colección de noticias en Español muestran la eficacia del sistema propuesto tanto para detectar documentos relevantes sobre desastres naturales (alcanzando una medida–F de 98%), así como para extraer hechos relevantes para ser insertados en una base de datos dada (alcanzando una medida–F de 76%).

Palabras claves: Aprendizaje Automático, Extracción de Información, Clasificación Temática de Textos, Desastres Naturales, Bases de Datos.

]]>

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgments

This work was partially supported by Conacyt through research grants (CB–61335, CB–82050 and CB–83459) and scholarship (171610).

References

1. Bouckaert, R. (2002). "Low level information extraction". In Proceedings of the workshop on Text Learning (TextML–2002), Sydney, Australia. [ Links ]

2. Cowie, J. and Lehnert, W. (1998). "Information Extraction". Communications of the ACM, Vol. 39, No. 1, pp. 80–91 [ Links ]

3. Freitag, D. (1998). "Machine Learning for Information Extraction in Informal Domains". Ph.d. thesis, Computer Science Department, Carnegie Mellon University. [ Links ]

4. Hobbs, J. R. (1992). "The Generic Information Extraction System". In B. Sundheim, editor. Fourth Message Understanding Conference (MUC–4), Mc Lean, Virginia, June. Distributed by Morgan Kauffman Publishers, Inc., San Mateo, California. [ Links ]

5. Ireson, N., Ciravega, F., Califf, M. E., Freitag, D., Kushmerick, N., and Labelli, A. (2005). "Evaluating Machine Learning for Information Extraction", In Proceedings of the 22^nd International Conference on Machine Learning, Bonn, Germany. [ Links ]

6. Jackson, P. & Moulinier, I. (2007). "Natural Language Processing for Online applications: text retrieval, extraction and categorization". John Benjamins Publishing Co, second edition, June. [ Links ]

7. Joachins, T. (2002). "Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms". Kluwer Academic Publishers, May. [ Links ]

8. Kushmerick, N., Johnston, E., and McGuinness, S. (2001). "Information Extraction by Text Classification". Seventeenth International Join Conference on Artificial Intelligence (IJCAI–2001), N. Kushmerick Ed. Adaptive Text Extraction and Mining (Working Notes), Seattle, Washington , pp. 44–50. [ Links ]

9. Li, Y., Bontcheva, K., and Cunningham, H. (2005). "SVM Based Learning System for Information Extraction". In Proceedings of Sheffield Machine Learning Workshop, Lecture Notes in Computer Science. Springer Verlag. [ Links ]

10. Mitchell, T. (1997). "Machine Learning". McGraw Hill. [ Links ]

11. Moens M. (2006). "Information Extraction: Algorithms and Prospects in a Retrieval Context". Springer (Information retrieval series, edited by W. Bruce Croft), October. [ Links ]

12. Muslea, I. (1999). "Extraction Patterns for Information Extractions Tasks: A Survey". In Proceedings of the AAAI Workshop on Machine Learning for Information Extraction, July, Orlando, Florida. [ Links ]

13. Peng, F. (1999). "Models Development in IE Tasks – A survey". CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo. [ Links ]

14. Riloff, E. (1996). "Automatically Generating Extraction Patterns from untagged text". In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI), pp. 1044–1049. [ Links ]

15. Riloff, E. & Jeffrey L. (1999). "Extraction–based text categorization: Generating domain–specific role relationships automatically". In Tomek Strzalkowski (Ed.), Natural Language Information Retrieval (pp. 167–196). Dordrecht, The Netherlands: Kluwer Academic Publishers. [ Links ]

16. Roth, D. & Yih, W. (2001). "Relational Learning Via Propositional Algorithms: An Information Extraction Case Study". In Proceedings of the 15th International Conference on Artificial Intelligence (IJCA–01I), Morgan Kauffman Publisher, Inc., San Francisco, California, pp. 1257–1263. [ Links ]

17. Salzberg, S. L. (1999). "On Comparing Classifiers: A Critique of Current Research and Methods". Data Mining and Knowledge Discovery, 1:1–12. [ Links ]

18. Scheffer T., Decomain C., & Wrobel S. (2001). "Active hidden Markov models for information extraction". Lecture Notes in Computer Science, Vol. 2189, Springer, pp. 309–318. [ Links ]

19. Sebastiani, F. (2002). "Machine Learning in Automated Text Categorization". ACM Computing Surveys. 34(1): 1–47. [ Links ]

20. Seymore, K., McCallum, A., & Rosenfeld, R. (1999). "Learning Hidden Markov Model structure for Information Extraction". In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI), pp. 37–42. [ Links ]

21. Sonderland, S., Fisher, D., Aseltine, J., & Lehnert, W. (1995). "CRYSTAL: Inducing a Conceptual Dictionary". In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1314–1321. [ Links ]

22. Sonderland, S. (1999). "Learning Information Extraction Rules for Semi–Structured and Free Text". Machine Learning, No. 34, pp. 233–272. [ Links ]

23. Stevenson M. & Greenwood M. A. (2006). "Comparing Information Extraction Pattern Models", In Proceedings of the Workshop on Information Extraction Beyond The Document, Association for Computational Linguistics, Sydney, pp. 12–19. [ Links ]

24. Turno, J. (2003). "Information Extraction, Multilinguality and Portability". Revista Iberoamericana de Inteligencia Artificial, No. 22, pp. 57–78. [ Links ]

25. Zavrel, J., Berck, P., & Lavrijssen, W. (2000). "Information Extraction by Text Classification: Corpus Mining for Features". In Proceedings of the workshop Information Extraction meets Corpus Linguistics, Athens, Greece. [ Links ] ]]> 1 2002 2002 Sydney 2 1998 39 1 1 80-91 3 1998 4 1992 Mc Lean Virginia 5 Bonn 6 2007 second 7 2002 8 2001 2001 44-50 9 2005 10 1997 11 2006 12 1999 13 1999 CS685 14 1996 1044-1049 15 1999 167-196 16 2001 1257-1263 17 1999 1 1-12 18 2001 2189 309-318 19 2002 34 1 1 1-47 20 1999 37-42 21 1995 1314-1321 22 1999 34 34 233-272 23 2006 Sydney 12-19 24 2003 22 22 57-78 25 2000 Athens